Apply Phase 5 round-6 narrative-consistency patches + audit artifact

Closes the four audit-surfaced concerns from paper/narrative_audit_v4.md plus the Opus round-2 N5 interpretive caveat. All five are prose-level consistency polishings; no empirical or structural changes. Concern A (Phase 4 line 31 / §I body): "Script 39c" provenance for the jittered-dHash claim was less precise than the §III line 59 source-of-truth which (post round-5) attributes the non-Big-4 jittered evidence to a codex-verified read-only spike. Updated §I to: "cosine: Script 39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for ten non-Big-4 firms." Concern B (Phase 4 line 81 / §V-B): same jittered-dHash claim without precise provenance. Updated §V-B to match Concern A attribution + §III-I.4 cross-reference. Concern C (§III-K.4 line 149): cross-reference to "v3.x §IV-I corpus-wide version" was stale after v4 §IV-I was shrunk to a reframing stub. Updated to "§III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I". Concern D (Spearman precision): standardized §III-K.1 table at lines 125-127 to 4 decimal places (0.963/0.889/0.879 -> 0.9627/0.8890/0.8794), matching §IV-F Table IX. Prose floor language "rho >= 0.879" preserved across Abstract/§I/§V/§VI since 0.8794 still rounds to 0.879 at 3dp. Opus N5 / §V-H limit 2 nuance: added a sentence interpreting the firm-dependent within-firm violation - Firm A's per-firm ICCR is more contaminated by within-firm sharing than B/C/D's, so the B/C/D rates of 0.09-0.16 are closer to clean specificity, and the Firm A vs B/C/D contrast reflects both genuine heterogeneity AND a firm-dependent proxy-contamination gradient. Audit artifact paper/narrative_audit_v4.md (~200 lines) captures the full cross-section coherence check across Abstract / §I / §III / §IV / §V / §VI: - Abstract -> body mirror audit (12 claims, all aligned) - §I 8 contributions -> §III/§IV/§V/§VI mapping (all aligned) - v3->v4 pivot rhetoric thread (5 nodes, all aligned) - K=3 demotion / ICCR-FAR / numbers consistency: all verified - Splice-readiness gate: 10/12 pass + 2 splice-time mechanical Headline assessment: "Mostly Coherent - submission-ready after 2-3 small patches" (now applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:22:22 +08:00
parent e9357c903b
commit 8dddc3b87c
3 changed files with 227 additions and 7 deletions
@@ -122,9 +122,9 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs

 | Pair | Spearman $\rho$ | $p$-value |
 |---|---|---|
-| Score 1 vs Score 3 | $+0.963$ | $< 10^{-248}$ |
-| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
-| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |
+| Score 1 vs Score 3 | $+0.9627$ | $< 10^{-248}$ |
+| Score 2 vs Score 3 | $+0.8890$ | $< 10^{-149}$ |
+| Score 1 vs Score 2 | $+0.8794$ | $< 10^{-142}$ |

 We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.

@@ -146,7 +146,7 @@ The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$

 **4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).

-We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the v3.x §IV-I corpus-wide version (reported under prior "FAR" terminology):
+We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I (reported under prior "FAR" terminology):

 | Candidate check | Pixel-identity miss rate (Wilson 95% CI) |
 |---|---|
@@ -28,7 +28,7 @@ Despite the significance of the problem for audit quality and regulatory oversig

 In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale, together with a multi-tool validation framework that explicitly discloses the unsupervised setting's limits. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) a multi-tool unsupervised validation strategy with disclosed assumption-violation analysis (§III-M).

-The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.
+The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c), and the corresponding within-firm jittered-dHash dip tests likewise fail to reject in all four Big-4 firms (Script 39d) and across a codex-verified read-only spike on the same ten mid/small firms ($0/10$ reject; §III-I.4). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.

 In place of distributional anchoring, v4.0 adopts an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the inherited cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample (replicating v3.x's reported per-comparison rate of $0.0005$ under prior "FAR" terminology); the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$ (v4 new); the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual.

@@ -78,7 +78,7 @@ Non-hand-signing differs from forgery in that the questioned signature is produc

 A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading.

-The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures (10 firms tested). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
+The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures tested (cosine: Scripts 39b/39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for the ten non-Big-4 firms; see §III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.

 ## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)

@@ -112,7 +112,7 @@ Several limitations should be transparent. The first nine are v4.0-specific; the

 *No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).

-*Inter-CPA negative-anchor assumption is partially violated.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs.
+*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ are therefore closer to a clean specificity estimate than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient.

 *Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.