Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings

Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues;
codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught
one additional cosine-P95 ambiguity Opus missed (methodology L255).
Total 12 text-only edits across 5 files.

MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite
the v3.12-corrected Section III-L but still wrote "P95" (self-
contradiction). Fix: methodology L165 and results L247 both restated
as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5%
complement spelled out.

MINOR findings and fixes:
- m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2
  L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually
  pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both
  sites now say "every auditor-year ... across all firms."
- m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21
  now add "of 180 registered CPAs; 178 after excluding two with
  disambiguation ties, Section IV-G.2" parenthetical to avoid the
  misleading 180−171=9 reading.
- m3 IV-H.1 A2 citation: results L286 now explicitly invokes the
  A2 within-year label-uniformity convention (Section III-G) when
  reading the left-tail share as a partner-level "minority of hand-
  signers."
- m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H
  → Section III-L anchor, and added explicit note that the 0.95
  heuristic is a whole-sample anchor while Table XI thresholds are
  calibration-fold-derived (cosine P5 = 0.9407).
- m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap:
  results L406 now explains the 4-report difference (XVI restricts
  to both-signers-Firm-A single-firm two-signer reports; XVII counts
  at-least-one-Firm-A signer under the 84,386-document cohort).
- m6 Methodology L156 "four independent quantitative analyses"
  actually enumerated 6 items: rephrased as "three primary
  independent quantitative analyses plus a fourth strand comprising
  three complementary checks."
- m7 Abstract "cluster into three groups" restored the "smoothly-
  mixed" qualifier to match Discussion V-B and Conclusion L17.
- Codex-caught residue at methodology L255 ("Median, 1st percentile,
  and 95th percentile of signature-level cosine/dHash distributions")
  grammatically applied P95 to cosine too. Rewrote as
  "cosine median, P1, and P5 (lower-tail) and dHash_indep median
  and P95 (upper-tail)" matching Table XI L233 exactly.

No re-computation. All tables (IV-XVIII) and Appendix A numbers
unchanged. Abstract at 249/250 words after smoothly-mixed qualifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-24 21:21:37 +08:00
parent 9b0b8358a2
commit ef0e417257
5 changed files with 12 additions and 12 deletions
+5 -5
View File
@@ -174,7 +174,7 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr
All rates computed exactly from the full Firm A sample (N = 60,448 signatures); counts reproduce from `signature_analysis/24_validation_recalibration.py` (whole_firm_a section).
-->
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H.
Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-L (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95).
The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E).
Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot.
@@ -244,7 +244,7 @@ We therefore interpret the held-out fold as confirming the qualitative finding (
### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic.
The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-L).
The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$.
Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable.
We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$.
@@ -283,7 +283,7 @@ Subsection H.3 applies the calibrated classifier and is therefore a consistency
### 1) Year-by-Year Stability of the Firm A Left Tail
Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year.
Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign.
Under the replication-dominated interpretation (Section III-H) and the within-year label-uniformity convention A2 (Section III-G), this left-tail share is read as a partner-level minority of Firm A CPAs who continue to hand-sign rather than as a bare signature-level rate.
Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.
<!-- TABLE XIII: Firm A Per-Year Cosine Distribution
@@ -308,7 +308,7 @@ This stability supports the replication-dominated framing: a persistent minority
### 2) Partner-Level Similarity Ranking
If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all Big-4 auditor-years.
If Firm A applies firm-wide stamping while the other Big-4 firms use stamping only for a subset of partners, Firm A auditor-years should disproportionately occupy the top of the similarity distribution among all auditor-years (across all firms).
We test this prediction directly.
For each auditor-year (CPA $\times$ fiscal year) with at least 5 signatures we compute the mean best-match cosine similarity across the year's signatures, yielding 4,629 auditor-years across 2013-2023.
@@ -403,7 +403,7 @@ A cosine-only classifier would treat all 71,656 identically; the dual-descriptor
96.9% of Firm A's documents fall into the high- or moderate-confidence non-hand-signed categories, 0.6% into high-style-consistency, and 2.5% into uncertain.
This pattern is consistent with the replication-dominated framing: the large majority is captured by non-hand-signed rules, while the small residual is consistent with the 32/171 middle-band minority identified by the accountant-level mixture (Section IV-E).
The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
The absence of any meaningful "likely hand-signed" rate (4 of 30,226 Firm A documents, 0.013%; the 30,226 count here is documents with at least one Firm A signer under the 84,386-document classification cohort, which differs from the 30,222 single-firm two-signer subset in Table XVI by 4 reports) implies either that Firm A's minority hand-signers have not been captured in the lowest-cosine tail---for example, because they also exhibit high style consistency---or that their contribution is small enough to be absorbed into the uncertain category at this threshold set.
We note that because the non-hand-signed thresholds are themselves calibrated to Firm A's empirical percentiles (Section III-H), these rates are an internal consistency check rather than an external validation; the held-out Firm A validation of Section IV-G.2 is the corresponding external check.
### 2) Cross-Method Agreement