Apply Phase 5 round-6 narrative-consistency patches + audit artifact

Closes the four audit-surfaced concerns from paper/narrative_audit_v4.md plus the Opus round-2 N5 interpretive caveat. All five are prose-level consistency polishings; no empirical or structural changes. Concern A (Phase 4 line 31 / §I body): "Script 39c" provenance for the jittered-dHash claim was less precise than the §III line 59 source-of-truth which (post round-5) attributes the non-Big-4 jittered evidence to a codex-verified read-only spike. Updated §I to: "cosine: Script 39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for ten non-Big-4 firms." Concern B (Phase 4 line 81 / §V-B): same jittered-dHash claim without precise provenance. Updated §V-B to match Concern A attribution + §III-I.4 cross-reference. Concern C (§III-K.4 line 149): cross-reference to "v3.x §IV-I corpus-wide version" was stale after v4 §IV-I was shrunk to a reframing stub. Updated to "§III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I". Concern D (Spearman precision): standardized §III-K.1 table at lines 125-127 to 4 decimal places (0.963/0.889/0.879 -> 0.9627/0.8890/0.8794), matching §IV-F Table IX. Prose floor language "rho >= 0.879" preserved across Abstract/§I/§V/§VI since 0.8794 still rounds to 0.879 at 3dp. Opus N5 / §V-H limit 2 nuance: added a sentence interpreting the firm-dependent within-firm violation - Firm A's per-firm ICCR is more contaminated by within-firm sharing than B/C/D's, so the B/C/D rates of 0.09-0.16 are closer to clean specificity, and the Firm A vs B/C/D contrast reflects both genuine heterogeneity AND a firm-dependent proxy-contamination gradient. Audit artifact paper/narrative_audit_v4.md (~200 lines) captures the full cross-section coherence check across Abstract / §I / §III / §IV / §V / §VI: - Abstract -> body mirror audit (12 claims, all aligned) - §I 8 contributions -> §III/§IV/§V/§VI mapping (all aligned) - v3->v4 pivot rhetoric thread (5 nodes, all aligned) - K=3 demotion / ICCR-FAR / numbers consistency: all verified - Splice-readiness gate: 10/12 pass + 2 splice-time mechanical Headline assessment: "Mostly Coherent - submission-ready after 2-3 small patches" (now applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:22:22 +08:00
parent e9357c903b
commit 8dddc3b87c
3 changed files with 227 additions and 7 deletions
@@ -0,0 +1,220 @@
 # Paper A v4.0 — Narrative-Thread Audit
 Auditor: Claude Opus 4.7 (1M context)
 Date: 2026-05-14
 Target: paper/v4/paper_a_prose_v4_phase4.md + paper/v4/paper_a_methodology_v4_section_iii.md + paper/v4/paper_a_results_v4_section_iv.md (post round-5, commit 128a914)
 Purpose: Coherence check across Abstract / §I / §III / §IV / §V / §VI as a single argument, after Phase 5 AI peer-review panel convergence (3/3 in Accept/Minor band).
 ## Headline assessment
 **Mostly Coherent — submission-ready after 2-3 small narrative-consistency patches.**
 The v4 story arc — *"v3.20.0 distributional path turns out to be composition + integer artefact; v4.0 replaces it with anchor-based ICCR + decisive firm heterogeneity; positioning is anchor-calibrated specificity-only screening, not validated detector"* — reads cleanly from Abstract through §VI. The 5 fix rounds + 3 reviewer panels have substantially closed the major framing, terminology, and provenance risks. What remains is narrow narrative-consistency residue between Phase 4 §I/§V prose and §III source-of-truth (3 specific items, all small), plus 1 interpretive caveat that would strengthen §V-H limitation 2.
 No empirical reruns required. No structural rewrites required. Submission-readiness gate: **conditionally pass** — recommend a 15-minute round-6 prose-consistency patch before manuscript-splice, then the manuscript is splice-ready.
 ## 1. Abstract → body mirror audit
 The Abstract (Phase 4 line 11, 247 words) makes ~12 distinct claims. Each maps cleanly to a body location:
 | # | Abstract claim | §I location | §III/§IV body location | Status |
 |---:|---|---|---|---|
 | 1 | Non-hand-signed detection problem (regulation + digitization) | §I paras 1–3 (lines 19–21) | — (problem framing) | **Aligned** |
 | 2 | Pipeline: VLM + YOLOv11 + ResNet-50 + dual-descriptor | §I para 6 (line 29) + contribution 2 | §III-A..F (inherited) + §III-F dual-descriptor | **Aligned** |
 | 3 | 90,282 reports / 182,328 sigs / 758 CPAs | §I para 8 (line 39) | §IV-A..C (inherited) | **Aligned** |
 | 4 | Big-4 sub-corpus: 437 CPAs / 150,442 sigs | §I para 8 (line 39) | §III-G (line 19), §IV-D (line 9, line 15) | **Aligned** |
 | 5 | Composition decomposition $p_{\text{median}} = 0.35$ | §I para 5 (line 31) + contribution 4 | §III-I.4 (lines 55–73), §IV-M.1 Table XX (line 266) | **Aligned** |
 | 6 | Per-comparison ICCRs 0.0006 / 0.0013 / 0.00014 | §I para 6 (line 33) + contribution 5 | §III-L.1 (line 196), §IV-M.2 Table XXI (line 280) | **Aligned** |
 | 7 | Per-signature ICCR 0.11 | §I para 6 (line 33) | §III-L.2 (line 208), §IV-M.3 Table XXII (line 300) | **Aligned** |
 | 8 | Per-document ICCR 0.34 (HC+MC) | §I para 6 (line 33) | §III-L.3 (line 233), §IV-M.4 Table XXIII (line 317) | **Aligned** |
 | 9 | Firm heterogeneity: Firm A 0.62 vs B/C/D 0.09–0.16 | §I para 7 (line 35) + contribution 6 | §III-L.4 (line 259), §IV-M.4 (line 325) | **Aligned** |
 | 10 | Within-firm 77–99% (any-pair) | §I para 7 (line 35) + contribution 6 | §III-L.4 (line 283), §IV-M.5 Table XXV (line 340) | **Aligned** |
 | 11 | "Specificity-proxy-anchored screening + HITL, not validated detector" positioning | §I contribution 8 (line 57) | §III-M Table XXVII (line 316) + §V-G/H + §VI item 8 | **Aligned** |
 | 12 | "No calibrated error rates without ground truth" disclaimer | §I para 5 item (v) (line 25) | §III-M (line 312), §V-H limit 1 (line 113) | **Aligned** |
 Observation: Abstract does NOT mention the three-score Spearman convergence ($\rho \geq 0.879$). This is by design — the v4 pivot demoted three-score from a headline finding to "internal consistency" because the scores share inputs. §I contribution 7 and §V-E retain it with the demoted caveat. **No action needed.**
 ## 2. §I contributions (8) → body implementation map
 | # | §I contribution | §III/§IV implementation | §V/§VI loop-back | Status |
 |---:|---|---|---|---|
 | 1 | Problem formulation | — (§V-A discusses) | §V-A (line 73), §VI implicit | **Aligned** |
 | 2 | End-to-end pipeline | §III-A..F (inherited) | §VI line 145 | **Aligned** |
 | 3 | Dual-descriptor verification | §III-F + §IV-L backbone ablation | §V-A/B implicit | **Aligned** |
 | 4 | Composition decomposition | §III-I.4 + §IV-M.1 Table XX | §V-B (line 81), §VI item 1 (line 147) | **Aligned** |
 | 5 | Anchor-based multi-level ICCR | §III-L + §IV-M.2/M.3/M.4 Tables XXI/XXII/XXIII | §V-F (line 99), §VI item 2 | **Aligned** |
 | 6 | Firm heterogeneity + within-firm collision | §III-L.4 + §IV-M.4/M.5 Tables XXIV/XXV | §V-C (line 83), §VI items 3+4 | **Aligned** |
 | 7 | K=3 descriptive + three-score convergence | §III-J + §III-K.1 + §IV-E/F/G | §V-D/E (lines 89–97), §VI items 5+6 | **Aligned** |
 | 8 | Annotation-free positive-anchor + ten-tool ceiling | §III-K.4 + §III-M Table XXVII + §IV-H Table XIV | §V-G (line 105), §VI items 7+8 | **Aligned** |
 All 8 contributions trace cleanly through §III/§IV implementation and §V/§VI loop-back. **No action needed.**
 ## 3. v3→v4 pivot rhetoric thread
 The v4 pivot has four narrative nodes; each must reinforce the others:
 | Node | Location | Says |
 |---|---|---|
 | **Setup**: v3.x distributional path | §I para 5 line 31 (Phase 4 prose) | "Earlier work...adopted a distributional path...v4.0 reports a composition decomposition diagnostic that overturns this reading" |
 | **Proof**: 2×2 factorial composition decomposition | §III-I.4 Scripts 39b–39e (lines 55–73) | Joint firm-mean centring + integer-tie jitter eliminates rejection ($p_{\text{median}} = 0.35$) |
 | **Alternative**: anchor-based ICCR | §III-L (line 173+) | Replaces distributional thresholds with inter-CPA coincidence-rate calibration at 3 units |
 | **Discussion**: K=3 stays descriptive | §V-B + §V-D (lines 77–93) | Mixture fits are firm-compositional partitions, not mechanism modes |
 | **Conclusion**: pivot summary | §VI items 1+5 (line 147) | Demotes K=3 mechanism reading; positions ICCR as the operational calibration |
 All five nodes use consistent language ("composition + integer artefact"; "descriptive firm-compositional partition"; "no within-population bimodal antimode"). **No action needed.**
 ## 4. K=3 demotion consistency
 Verified across 4 locations using consistent descriptor-position language (post round-2 M1 fix):
 - §III-J line 90 (source of truth): "The 'descriptive position' column replaces v3.x's 'hand-leaning / mixed / replicated' mechanism labels"
 - §I contribution 7 (line 55): "K=3 mixture demoted from 'three mechanism clusters' to a descriptive firm-compositional partition"
 - §V-D (line 93): "the K=3 stability supports a descriptive reading...*not* a three-mechanism latent-class structure"
 - §VI item 5 (line 147): same demotion language
 - §IV Tables XVI/XVII column headers: "C1 (low-cos / high-dHash) | C2 (central) | C3 (high-cos / low-dHash)" — descriptor-position labels throughout
 `grep -n "hand-leaning"` in v4 public prose: 0 hits (only internal-strip text). **Closed.**
 ## 5. ICCR vs FAR terminology consistency
 Verified across all rate-reporting locations (post round-2 + round-5):
 - Abstract: "inter-CPA coincidence-rate (ICCR)" ✓
 - §I contribution 5 (line 51): explicit terminology adoption and FAR disclaimer ✓
 - §III-L.1 (line 185): "Terminological note on 'FAR'" with full disclaimer ✓
 - §IV-I (line 159): historical "FAR" cited only with the "v3.x terminology" caveat ✓
 - §V-G heading (line 105): "Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate" ✓
 - §V-H limitations: specificity-proxy framing under partially-violated assumption ✓
 - §VI item 2 (line 147): "explicit terminological replacement of 'FAR' by 'ICCR' given the unsupervised setting" ✓
 No public-prose "FAR" leak outside the historical-context caveats. **Closed.**
 ## 6. Numbers consistency audit (cross-section)
 Headline numbers cross-referenced between Abstract / §I / §III / §IV:
 | Claim | Abstract | §I body | §III source | §IV table | Status |
 |---|---|---|---|---|---|
 | Per-comparison ICCR cos $>0.95$ | 0.0006 | 0.0006 | 0.00060 | 0.00060 Table XXI | **Match** (rounding consistent) |
 | Per-comparison ICCR dHash $\leq 5$ | 0.0013 | 0.0013 | 0.00129 | 0.00129 Table XXI | **Match** |
 | Per-comparison joint | 0.00014 | 0.00014 | 0.00014 | 0.00014 Table XXI | **Match** |
 | Per-signature ICCR | 0.11 | 0.11 | 0.1102 (Wilson) | 0.1102 Table XXII | **Match** |
 | Per-document ICCR (HC+MC) | 0.34 | 0.34 | 0.3375 | 0.3375 Table XXIII | **Match** |
 | Firm A doc HC+MC | 0.62 | 0.62 | 0.6201 | 0.6201 §IV-M.4 line 325 | **Match** |
 | Firms B/C/D doc HC+MC | 0.09–0.16 | 0.09–0.16 | 0.1600 / 0.1635 / 0.0863 | same | **Match** |
 | Within-firm any-pair | 77–99% (rounded) | 76.7–83.7% / 98.8% | same | Table XXV | **Match** |
 | Same-pair within-firm | — | 97.0–99.96% | 99.96 / 97.7 / 98.2 / 97.0 | line 349 | **Match** |
 | Composition $p_{\text{median}}$ | 0.35 | 0.35 | 0.35 | 0.35 Table XX | **Match** |
 | Logistic OR | — | 0.053 / 0.010 / 0.027 | same | Table XXIV | **Match** |
 | Spearman ρ floor | — | 0.879 | 0.879 | 0.8794 Table IX | **Match** (Spearman precision §III/§IV differ at 4th decimal — see §8) |
 All headline numbers reconcile across sections. **Closed.**
 ## 7. Limitations vs §III-M Table XXVII coverage
 §V-H lists 14 limitations (9 v4-specific + 5 inherited from v3.20.0). Each Table XXVII assumption should be covered by an explicit §V-H item OR be self-evidently descriptive:
 | Table XXVII tool | Assumption | §V-H coverage |
 |---|---|---|
 | Composition decomposition | Jitter unbiased; Big-4 jittered + centred + jittered evidence | Implicit — covered by general "no signature-level ground truth" frame |
 | Per-comparison ICCR | Inter-CPA pairs are negative anchor (partially violated) | §V-H limit 2 (explicit) |
 | Per-signature ICCR | Same + pool replacement preserves negative-anchor property | §V-H limit 2 (implicit via "specificity-proxy rates under partially-violated assumption") |
 | Per-document ICCR | Same | Same |
 | Firm-heterogeneity logistic | Cluster-robust SE not run | **Gap** — no §V-H item explicitly flags the naive-SE caveat |
 | Cross-firm hit matrix | Deployed-rule semantics + mode-of-firms tie-break | §V-H limit 2 |
 | Alert-rate sensitivity | Descriptive gradient, not formal plateau | §V-H limit 5 (line 121, "alert-rate sensitivity analysis characterises only the HC threshold") |
 | Three-score Spearman | Scores share inputs | §V-H limit 6 (line 123, deployed-rate-excess interpretation) — partial; not the score-independence caveat directly |
 | Pixel-identical positive capture | Tautological (byte-identical ⇒ in HC region) | §V-H limit 4 (line 119, "pixel-identity is a conservative subset") |
 | LOOO firm-level reproducibility | Stability ≠ classification validity; K=3 membership ±12.8 pp | §V-H limit 8 (line 127, "K=3 hard-posterior membership is composition-sensitive") |
 **Gap 1**: §V-H does not explicitly flag the logistic-regression naive-SE caveat that Table XXVII row 5 discloses. Worth adding a half-sentence to §V-H or letting Table XXVII carry the disclosure since it's already in print.
 **Gap 2** (Opus N5 from round-2 audit): §V-H limit 2 discloses the firm-dependent within-firm violation numerically (98.8% at A; 76.7–83.7% at B/C/D) but does not interpret what this means for proxy reliability — namely that Firm A's per-firm ICCR is MORE contaminated by within-firm sharing than B/C/D's, so the per-firm B/C/D rates are closer to clean specificity. This nuance affects interpretation of the headline "firm heterogeneity is decisive" framing.
 ## 8. Net-new narrative concerns (audit-surfaced)
 ### Concern A — Phase 4 §I body line 31 cites "Script 39c" for jittered-dHash claim
 **Issue.** Phase 4 prose line 31 says:
 > "Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c)."
 Codex round-9 verified that Script 39c on RAW dHash actually REJECTS unimodality in all 10 firms; only the JITTERED variant (codex's read-only spike on the Script 39c substrate) fails to reject. Round-5 corrected §III line 59 + provenance table line 382 to cite the spike attribution, but Phase 4 §I line 31's bare "Script 39c" citation now reads less precisely than the source-of-truth at §III line 59.
 **Severity.** Low. The qualitative claim ("fail to reject in 10 mid/small firms") is correct per codex's own rerun. The provenance attribution is what's slightly off.
 **Recommended fix.** Update Phase 4 line 31 to match §III line 59: "...in every individual mid/small firm with $\geq 500$ signatures (10 firms tested; cosine: Script 39c per-firm; jittered-dHash: codex-verified read-only spike on Script 39c substrate)." Or simpler: drop the parenthetical "in Script 39c" and let §III carry the precise provenance.
 ### Concern B — Phase 4 §V-B line 81 carries the same jittered-dHash claim without provenance
 **Issue.** §V-B (line 81) says:
 > "Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures (10 firms tested)."
 No script citation here; the bare "10 firms tested" is technically OK but less precise than §III line 59 after round-5.
 **Severity.** Low.
 **Recommended fix.** Add a §III cross-reference: "...10 firms tested; see §III-I.4 / §III provenance table line for the codex-verified read-only spike." Or leave as-is and let §III line 59 carry the detailed provenance.
 ### Concern C — §III-K.4 line 149 stale cross-reference to v3.x §IV-I
 **Issue.** §III-K item 4 line 149 says:
 > "The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the v3.x §IV-I corpus-wide version (reported under prior 'FAR' terminology)"
 After v4 §IV-I was shrunk to a 3-paragraph reframing stub (post round-3), the phrase "v3.x §IV-I corpus-wide version" is misleading — v4 §IV-I now exists, just as a pointer. Opus round-2 N6 flagged this; codex round-9 did not.
 **Severity.** Cosmetic.
 **Recommended fix.** Update line 149 to "§III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I (reported under prior 'FAR' terminology)".
 ### Concern D — Spearman precision mismatch §III vs §IV
 **Issue.** §III-K.1 line 123–127 reports Spearman ρ as 0.963 / 0.889 / 0.879 (3 decimal places); §IV-F Table IX line 81–87 reports 0.9627 / 0.8890 / 0.8794 (4 decimal places). Codex round-8 flagged this as OPEN / COPY-EDIT.
 **Severity.** Cosmetic.
 **Recommended fix.** Standardise on 4 decimal places (matches Script 38 reported precision) in §III + §IV + §V-E + §VI.
 ## 9. Splice-readiness gate
 | Item | Status | Notes |
 |---|---|---|
 | Abstract word count | ✓ | 247 / 250 |
 | §I contributions count | ✓ | 8 contributions, all map to body |
 | §II LOOO addition with refs [42]-[44] | ✓ | Present (post round-1) |
 | §III sub-sections G..M complete | ✓ | Including Table XXVII numbered |
 | §IV table sequence V–XXVI sequential | ✓ | Post round-2 cascade |
 | §V sub-sections A..H complete | ✓ | Post round-2 M4 fix (G→H) |
 | §VI items 1..8 map to §I 1..8 | ✓ | 1:1 mapping verified |
 | References [1]–[44] present | ✓ | 44 entries; [42]-[44] = Stone 1974 / Geisser 1975 / Vehtari 2017 |
 | Internal draft notes stripped | ✗ | **Splice-time mechanical** — Phase 4 line 3 + lines 153-162; §III line 3 + lines 434-447; §IV line 3 + line 365+ |
 | "Nine-tool" / "Table XV-B" residue | ✗ | **Splice-time mechanical** — only in internal-strip text |
 | Cross-section number consistency | ✓ | All headline numbers match across Abstract / §I / §III / §IV |
 | Terminology consistency (ICCR / K=3 / less-replication-dominated) | ✓ | No public-prose leaks |
 | IEEE Access format | ✓ | Abstract single-paragraph ≤250 words; numbered references; numbered tables |
 ## 10. Recommended round-6 narrative-consistency patch (15 min, optional)
 Before manuscript-splice, three small text patches would close the audit-surfaced concerns:
 1. **Concern A**: Phase 4 line 31 — narrow the "Script 39c" provenance attribution for the jittered-dHash claim to match §III line 59.
 2. **Concern C**: §III line 149 — update the "v3.x §IV-I corpus-wide version" wording to reflect v4 §IV-I's reframing-stub status.
 3. **Concern D**: Standardise Spearman precision to 4 decimal places across §III/§IV/§V.
 Optional:
 - **Gap 2 / Opus N5**: Add a half-sentence to §V-H limit 2 interpreting the firm-dependent within-firm violation as "Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's, so the per-firm B/C/D rates are closer to clean specificity than the pooled rate."
 None of these is empirical or structural. They are prose-level consistency polishings. **Submission can proceed without them**, but they would strengthen reviewer-pass robustness.
 ## 11. Submission-readiness verdict
 **Conditionally ready.**
 The empirical core is sound and reproducible. The Phase 5 panel converged 3/3 in Accept/Minor band. Five fix rounds have closed every reviewer-flagged finding. Numbers are consistent across sections. Terminology is consistent. The v4 pivot's narrative thread reads as a coherent argument from Abstract through Conclusion.
 **Path A (ship now)**: proceed directly to manuscript-splice → DOCX export → partner Jimmy review → submission. Risk: the four audit-surfaced concerns above may be flagged by external reviewers as small cosmetic issues. Cost: zero pre-submission work.
 **Path B (15-min polish first)**: apply round-6 patches for the four concerns above → re-verify with `rg`-based grep → manuscript-splice → DOCX → partner. Risk: zero. Cost: 15 min.
 **Recommendation**: Path B. The four concerns are small, narrative-consistency only, and fixing them avoids putting cross-section attribution inconsistencies in front of partner Jimmy / IEEE Access reviewers.
 After Path B (or A): proceed to manuscript-splice as the next mechanical step.
@@ -122,9 +122,9 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs
 | Pair | Spearman $\rho$ | $p$-value |
 |---|---|---|
-| Score 1 vs Score 3 | $+0.963$ | $< 10^{-248}$ |
+| Score 1 vs Score 3 | $+0.9627$ | $< 10^{-248}$ |
-| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
+| Score 2 vs Score 3 | $+0.8890$ | $< 10^{-149}$ |
-| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |
+| Score 1 vs Score 2 | $+0.8794$ | $< 10^{-142}$ |
 We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.
@@ -146,7 +146,7 @@ The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$
 **4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).
-We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the v3.x §IV-I corpus-wide version (reported under prior "FAR" terminology):
+We report each candidate check's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I (reported under prior "FAR" terminology):
 | Candidate check | Pixel-identity miss rate (Wilson 95% CI) |
 |---|---|
@@ -28,7 +28,7 @@ Despite the significance of the problem for audit quality and regulatory oversig
 In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale, together with a multi-tool validation framework that explicitly discloses the unsupervised setting's limits. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) a multi-tool unsupervised validation strategy with disclosed assumption-violation analysis (§III-M).
-The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.
+The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c), and the corresponding within-firm jittered-dHash dip tests likewise fail to reject in all four Big-4 firms (Script 39d) and across a codex-verified read-only spike on the same ten mid/small firms ($0/10$ reject; §III-I.4). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.
 In place of distributional anchoring, v4.0 adopts an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the inherited cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample (replicating v3.x's reported per-comparison rate of $0.0005$ under prior "FAR" terminology); the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$ (v4 new); the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual.
@@ -78,7 +78,7 @@ Non-hand-signing differs from forgery in that the questioned signature is produc
 A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading.
-The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures (10 firms tested). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
+The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures tested (cosine: Scripts 39b/39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for the ten non-Big-4 firms; see §III-I.4). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
 ## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
@@ -112,7 +112,7 @@ Several limitations should be transparent. The first nine are v4.0-specific; the
 *No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
-*Inter-CPA negative-anchor assumption is partially violated.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs.
+*Inter-CPA negative-anchor assumption is partially violated and the violation is firm-dependent.* The cross-firm hit matrix of §III-L.4 shows that under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$–$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$–$99.96\%$ within-firm across all four firms), consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs. Because the violation is firm-dependent, Firm A's per-firm ICCR is more contaminated by within-firm sharing than Firms B/C/D's; the per-firm B/C/D rates of $0.09$–$0.16$ are therefore closer to a clean specificity estimate than the pooled rate, and the Firm A vs Firms B/C/D contrast reflects both genuine firm heterogeneity and a firm-dependent proxy-contamination gradient.
 *Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.