Paper A v13 rev9.1: HC-meaning + same-pair table + interview/framing rebalance, plus typesetting polish

Respond to a second hostile GPT-5.5 reviewer pass on rev9. Four substantive changes plus accumulated typesetting polish. Reviewer points addressed: - HC != reuse (Fatal 1): new Sec III-F "What HC Means and Does Not Mean" states plainly that HC denotes an extreme within-accountant repetition pattern that is rare between unrelated accountants, not a reuse label; reuse is one interpretation, carried at Firm A by byte-identity + context, never implied by HC alone; no reuse claim is made for Firms B/C/D. - Any-pair construction (Fatal 2): new Table VI gives the per-signature HC flag rate by firm under the deployed any-pair rule vs the strict same-pair rule (cosine and dHash from the same partner). Same-pair lowers all rates but widens the firm gap: Firm A 57.3% vs baseline 5-9%, ratio 2.4-3.4x -> 6.4-10.8x, so the HC region is not an artefact of combining extrema from different pairs. Reproducible via samepair_hc.py (Hamming on stored dHash vectors). - Interviews (Fatal 3): Sec III-A now states the interviews are used only to contextualize, are corroborative not confirmatory and not independently reproducible; their one load-bearing use (Firm A as known-positive benchmark) lowers rather than raises the claim. Empirical claims rest on calibration + byte-identity, which stand without them. - Framing (Fatal 4, rebalance not relabel): contribution 3 elevated to the methodological core (label-free construction/characterization of an operating point without labels), explicitly demonstrated/stress-tested on audit signatures "rather than a finished, fully general framework." The audit finding is kept as a headline result, not demoted to a mere case study, and no general-framework claim is made. Typesetting polish (verified by rendering pages to images): - Unify scientific notation in Table II ([4x10^-6, 2.3x10^-5]). - Tighten Table II row labels to cut excessive wrapping (3 lines -> 2). - Fix duplicated figure captions (empty image alt-text so pandoc no longer auto-captions on top of the hand-written caption); unify caption punctuation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Qn59FdF9JMyfFg3sjcUNNG
2026-06-23 15:37:13 +08:00
parent cb38d413ad
commit 2a13f0d985
3 changed files with 56 additions and 14 deletions
@@ -30,7 +30,7 @@ We make four contributions:

 1. An end-to-end screening pipeline that turns raw audit-report PDFs into operational risk strata for hundreds of thousands of signatures.
 2. A dual descriptor that separates style consistency from image reproduction — a distinction a single similarity measure blurs.
-3. A label-free, anchor-calibrated operating point that is both a method and a concrete, reusable rule. With neither a natural cutoff in the data nor labels to learn one from, we set a tunable rule by measuring how often it fires by chance in a clean reference group, and we say plainly what that measure can and cannot support. The result is not only a calibration method but a concrete operating point — the high-confidence rule and its measured specificity proxy — that practitioners working with comparable Chinese-signature image pipelines can use as a starting reference (not transplant unchanged, since the proxy is conditional on a similar preprocessing and reference-group setup), together with a defined disposition path for the ambiguous middle (calibrated demotion of the low-specificity band, aggregation to the accountant level, byte-identity escalation, and a bounded manual protocol) that keeps human review at the scale of exceptions.
+3. The methodological core: a label-free way to *construct and characterize* a screening operating point when no signature-level labels exist — the question this paper is really organized around. With neither a natural cutoff in the data nor labels to learn one from, we set a tunable rule by measuring how often it fires by chance in a clean reference group, and we say plainly what that measure can and cannot support. This is the part we expect to transfer beyond the present setting; we demonstrate and stress-test it at scale on audit signatures rather than claiming it as a finished, fully general framework. Concretely the result is not only a calibration method but a concrete operating point — the high-confidence rule and its measured specificity proxy — that practitioners working with comparable Chinese-signature image pipelines can use as a starting reference (not transplant unchanged, since the proxy is conditional on a similar preprocessing and reference-group setup), together with a defined disposition path for the ambiguous middle (calibrated demotion of the low-specificity band, aggregation to the accountant level, byte-identity escalation, and a bounded manual protocol) that keeps human review at the scale of exceptions.
 4. A demonstration on Chinese signatures, a structurally complex and comparatively under-served script for signature analysis. Because our descriptors work on the image rather than on script-specific strokes, the approach does not depend on Latin-script assumptions and is a candidate for other scripts.

 The paper is organized to move from the problem to the evidence. Section II reviews related work and states the gap. Section III describes the study design — the data split, the pipeline, the five-way rule, and the calibration logic — and explains why each piece is built the way it is. Section IV reports the results: the calibration baseline, which category needs human review, and the held-out benchmark on Firm A. Section V collects supporting analyses, including the diagnostic showing that no natural cutoff exists. Section VI concludes.
@@ -57,7 +57,7 @@ This section explains how the study is built and why. We report no computed numb

 ### A. Institutional Background

-To pin down the signing practices that we need in order to interpret the results, we held semi-structured interviews with certifying partners and signing-system staff at all four firms in the study.¹ Three points do real work later. First, all four firms allow handwritten signing but none require it. Second, formal firm-wide electronic signing or sealing systems were adopted on staggered dates from 2020 onward. Third, one firm — which we call Firm A throughout — has used scanned-image overlay stamping as its usual practice since at least 2013. We use these facts only as background, not as labels for individual signatures: they guide how we split the data below and how we read the firm-level results in Section IV-C, but they do not tell us the status of any single signature. A further caution applies to how the interviews are used as corroboration. They are self-reported, anonymized, and not independently reproducible, so when the screen's firm-level output agrees with them (Section IV-C) that agreement is evidence of consistency with domain knowledge, not a measurement of the screen's accuracy or recall — quantifying those would require signature-level labels, which the archive does not provide. The practical implication is that the years before the formal systems (before 2020) are the right "normal" period to use for calibration.
+To pin down the signing practices that we need in order to interpret the results, we held semi-structured interviews with certifying partners and signing-system staff at all four firms in the study.¹ Three points do real work later. First, all four firms allow handwritten signing but none require it. Second, formal firm-wide electronic signing or sealing systems were adopted on staggered dates from 2020 onward. Third, one firm — which we call Firm A throughout — has used scanned-image overlay stamping as its usual practice since at least 2013. We use these facts only as background, not as labels for individual signatures: they guide how we split the data below and how we read the firm-level results in Section IV-C, but they do not tell us the status of any single signature. A further caution applies to how the interviews are used as corroboration. They are self-reported, anonymized, and not independently reproducible, so when the screen's firm-level output agrees with them (Section IV-C) that agreement is evidence of consistency with domain knowledge, not a measurement of the screen's accuracy or recall — quantifying those would require signature-level labels, which the archive does not provide. To be unambiguous about their role: the interviews are used only to *contextualize* the firm-level findings and are not treated as validation. They are corroborative, not confirmatory, and not independently reproducible; the empirical claims of this paper rest on the calibration and the byte-identical evidence, which stand without them. Their one load-bearing use is to motivate why Firm A is read as a known-positive benchmark rather than a blinded test (Section IV-C) — a framing that, if anything, lowers the evidentiary status we claim for that firm. The practical implication is that the years before the formal systems (before 2020) are the right "normal" period to use for calibration.

 > ¹ Footnote — institutional detail. The interviews were conducted under institutional research-ethics approval and are reported in anonymized, aggregated form; firms are labeled A–D and no individual can be identified. The formal systems were reported to have been adopted at roughly one firm in early 2020, one in 2021, and one in late 2022 (exact firm-level dates are withheld for anonymity; see supplementary materials). Interviewees attributed this timing partly to the COVID-19 pandemic, which forced remote review and signing, and to firm-wide paperless and environmental (ESG) initiatives — both of which accelerated the move to formal electronic signing at Firms B/C/D. For Firm A, the reported workflow is that the certifying accountant approves the finished report electronically, after which the print room overlays the accountant's stored seal or signature image onto the PDF and prints it; the stored image is rarely changed, and although handwritten signing is allowed it is reported to be very rare, and rarer over time. Before the formal systems, the other firms' practice varied: some used informal scan- or photocopy-based stamping alongside handwritten signing, and at least one reported mostly handwritten signing before its system. The property the calibration relies on (Section III-E) is that, in the pre-2020 baseline firms, different accountants did not share a common template — not that every signature was handwritten.

@@ -71,7 +71,7 @@ The corpus is all retrievable Taiwan statutory audit reports for fiscal years 20

 We explain the reason for each part in Section III-E. The key idea is simple: we calibrate only on the clean cell — the non-Firm-A firms in the years before formal systems — and test everything else against it. No numbers appear here; the calibration results start in Section IV-A.

-![Figure 1](figures/fig1.png)
+![](figures/fig1.png)

 *Figure 1. The data split. Rows are Firms A–D; columns are 2013–2019 and 2020–2023. The B/C/D × 2013–2019 cells are the clean calibration group; Firm A (both periods) is held-out benchmark 1 (a known positive); B/C/D × 2020–2023 is the secondary held-out test. We calibrate only on the clean cell and test everything else against it.*

@@ -89,7 +89,7 @@ Assigning each signature to an accountant. Each signature is matched to a regist

 (Detection accuracy, signature counts, match rates, and the resulting analysis sample are reported in Section IV-A.)

-![Figure 2](figures/fig2.png)
+![](figures/fig2.png)

 *Figure 2. The screening pipeline. A raw PDF passes through page-finding (a vision-language model), signature detection (YOLOv11) with red-stamp removal, feature extraction (ResNet-50), the two per-signature similarities (cosine for style; the smallest dHash to the same accountant for structure), and a five-way label.*

@@ -139,6 +139,10 @@ We report the rule's chance rate at three levels, because the rule takes the bes

 One further assumption deserves to be stated rather than buried, because it concerns how the clean group was chosen. The floor is *conditional on the reference group actually being clean* — it is a coincidence rate among accountants we take to be independent hand-signers, and the group (non-Firm-A firms, pre-2020) was selected partly because its rates are low and its practices, by the interviews, are not stamping-dominated. That selection is mild but not innocent: if some baseline accountants in fact reuse images undetected, the reference is contaminated. The direction of that error, however, is reassuring for the Firm-A contrast. Undetected reuse inside the baseline would only *raise* the between-accountant coincidence floor, which makes Firm A's gap above it *smaller*, not larger — so contamination of the clean group biases the headline contrast conservatively, against our conclusion rather than toward it. Two pieces of evidence bound the concern empirically. First, the three baseline firms are mutually consistent and uniformly low (Firms B and C within about 3.5× of each other, none close to Firm A; Section IV-A), so the floor does not hinge on any single firm and a leave-one-baseline-firm-out reading does not move it materially. Second, the one data-derived threshold, the low cosine cut, is stable when the group composition is changed — 0.8547 on the calibration cell, 0.8302 with the non-Big-4 firms folded in, a shift of at most 0.025 (Section V-C) — so widening or narrowing the reference at its boundary does not move the operating point. We therefore treat the clean-group assumption as a stated limitation with a known-safe error direction, not as a hidden premise.

+### F. What HC Means and Does Not Mean
+
+One sentence prevents the most common misreading of everything that follows. *HC is not a reuse label.* HC denotes an extreme *within-accountant repetition pattern* — a signature whose closest match among the same accountant's own signatures is both stylistically near-identical (cosine > 0.95) and structurally near-identical (dHash ≤ 5) — that is *statistically rare between unrelated accountants*, by the ICCR calibration of Section III-E. That is the whole of what the rule, on its own, establishes. Reuse of a stored image is one interpretation of an HC pattern, and the most economical one, but the rule does not imply it: a very steady hand, a fixed scanning-and-assembly pipeline, or a uniform house style can each raise within-accountant repetition (Section V-A, Section V-B). Where we go further than "extreme repetition" — as we do for Firm A — the additional weight comes from outside the rule: byte-identical signatures, which independent hand-signing cannot produce, and the institutional context, neither of which is implied by HC alone. For Firms B/C/D we make no reuse claim at all; their HC signatures are reported as a within-accountant repetition rate, not as detected reuse. Read this way, HC is a calibrated, reproducible screening category, and "reuse" is a conclusion that has to be earned separately — firm by firm, or signature by signature — rather than read off the label.
+
 ## IV. Findings

 This section reports the numbers. It starts with the calibration baseline (Firms B/C/D, 2013–2019), then says which category needs human review, then presents the held-out benchmark on Firm A.
@@ -161,7 +165,7 @@ Detection and the analysis sample (whole corpus). Two scopes appear in this sect

 The calibrated operating point: the four cut values and their bases. The five-way rule of Section III-D uses four cut values; we state them here because two are read directly from this study's data. The low cosine cut, 0.8547, is the crossover of the same-accountant and different-accountant cosine distributions computed on the calibration cell alone (Firms B/C/D, 2013–2019, closed-world: both the source signatures and their comparison set drawn from that cell; Section IV-C). We use this closed-world value as the primary cut rather than the corpus-wide crossover, so that the one data-derived threshold in the rule is estimated only on the calibration-only Firms-B/C/D 2013–2019 cell, held out from Firm A and from post-2020 scoring. The cut is stable across scopes — 0.8547 (calibration closed-world), 0.8367 corpus-wide, 0.8489 on the all-period baseline firms, 0.8302 with the non-Big-4 firms added; it moves by at most 0.025 across all four scopes (0.018 from the corpus-wide value), so the choice of scope is immaterial and the broader-scope values stand as robustness checks (Section V-C). The high cosine cut, 0.95, is the high-similarity operating point: it sits in the region where genuine reuse concentrates — the byte-identical anchor (Section IV-C) lies at cosine 1 — and a recalibration cannot move it onto a distributional antimode because none exists (no within-population bimodality, Section V-A). The near-identical structural cut, dHash ≤ 5, is the perceptual-hash distance below which two rasters are pixel-equivalent up to mild recompression, and dHash ≤ 15 bounds the looser "structurally similar" band; both follow the standard 64-bit dHash distance scale [27]. We therefore do not re-derive these three as optimal cutoffs but characterize their chance-of-firing behavior directly (the full prior-calibration provenance is in the supplementary materials), and we make them operator-tunable in one direction: their specificity proxy at these values is read off the chance-rate calibration below, and an operator can tighten the floor by inverting the ICCR curve (for example, dHash ≤ 3). This is a conservativeness dial, not a precision–recall control: tightening raises the specificity proxy and lowers the flag count, but there is no observable recall to trade back, so loosening cannot be calibrated against a known cost. We deliver these as a concrete, calibrated operating point — in particular the high-confidence (HC) rule, cosine > 0.95 and dHash ≤ 5 — whose between-accountant coincidence behavior the calibration below makes explicit. Because the rule is calibrated on a large Chinese-signature corpus, the HC values double as a practical starting reference for practitioners working with comparable Chinese-signature image pipelines, rather than a setting to transplant unchanged.

-![Figure 3](figures/fig3.png)
+![](figures/fig3.png)

 *Figure 3. The two measures and the five regions, drawn as the real 2D density of all Big-4 signatures (n = 150,441; log color scale, integer dHash bins). The cosine axis is split at the low cut 0.8547 (the calibration-cell same-vs-different-accountant crossover) and the high cut 0.95; within the high-cosine band the dHash axis is split at 5 and 15. The mass concentrates in the bottom-right HC corner — high cosine with near-identical structure — and thins out as a single continuum toward lower cosine and higher dHash, with no gap separating a "reuse" cluster from a "hand-signed" one (Section V-A); note also that essentially all signatures sit above cosine ≈ 0.85, the compressed high-similarity range discussed in Section V-A.*

@@ -173,9 +177,9 @@ How often the strict rule fires by chance (pooled). In the Firms-B/C/D 2013–20

 | Group / rule | Per comparison | Per signature | Per report |
 |---|---|---|---|
-| HC rule — B/C/D 2013–2019 (calibration) | 1.0×10⁻⁵ [4e-6, 2.3e-5] | 0.59% [0.45%, 0.73%] | 1.2% |
-| HC rule — all four firms (contamination check) | 1.4×10⁻⁴ | 11.0% | 18.0% |
-| MC band (HC+MC) — B/C/D 2013–2019, per report | — | — | ≈17.5% |
+| HC — B/C/D 2013–2019 (calibration) | 1.0×10⁻⁵ [4×10⁻⁶, 2.3×10⁻⁵] | 0.59% [0.45%, 0.73%] | 1.2% |
+| HC — all four firms (contaminated) | 1.4×10⁻⁴ | 11.0% | 18.0% |
+| MC band (HC+MC) — B/C/D 2013–2019 | — | — | ≈17.5% |

 Each baseline firm on its own (B, C, D). Reported separately, the three baseline firms are alike and uniformly low. A logistic regression of the per-signature HC flag on firm (with Firm D as the reference) over the baseline cell puts Firms B and C within about 3.5× of each other (odds ratios 1.73 and 0.49), and none of them comes close to the high rates we see for Firm A in Section IV-C. The 2013–2019 five-way breakdown for each of Firms B/C/D (counts and within-firm percentages) is reported in Table II-b; the full-period (2013–2023) breakdown is in Table IV for reference.

@@ -217,9 +221,9 @@ Firm A's within-firm repeatability, against the other firms. On their own signat

 Four further checks confirm the contrast is not an artefact of how the comparison pools are built, of the imaging-pipeline trend, or of any single year. First, pool size. Stratifying accountants by how many signatures they contribute and comparing within each stratum, Firm A's HC rate exceeds the other firms' at every level — 66% versus 20% for the smallest pools (under 50 signatures), rising to 76–84% versus 21–29% for larger pools. Even Firm-A accountants with few signatures to match against fire the rule far more often than B/C/D accountants with the same pool size; pool size raises the rate within every firm (the log-pool-size odds ratio of 4.01), but the firm gap dwarfs it and survives at fixed pool size, which rules out the "more signatures, more chances for an extreme match" explanation. Second, dependence among an accountant's own signatures. Re-estimating the gap with the bootstrap resampled at the accountant level (179 Firm-A accountants, 280 at Firms B/C/D) rather than treating signatures as independent, the Firm-A-minus-B/C/D difference in HC rate is 53.7 percentage points with a 95% interval of [49.5, 57.5] — accountant-level clustering widens the intervals the per-signature Wilson bounds give, but leaves the contrast far too large to be explained away. Third, the time trend and pipeline shift (Section V-B). Adding year fixed effects to the logistic regression — so the firm effect is identified within year, net of the 2020–2021 imaging-pipeline transition — leaves Firms B/C/D at 0.06–0.12 times Firm A's odds of an HC flag (odds ratios 0.116, 0.061, 0.070), still an order of magnitude lower once the common time trend is absorbed. Fourth, single-year dependence. Leaving out each calendar year in turn and recomputing, the Firm-A-minus-B/C/D gap stays within 53.1–54.9 percentage points (full-sample 53.7), so neither the high-reuse digital-native years (2022–2023) nor any earlier year drives it.

-![Figure 4](figures/fig4.png)
+![](figures/fig4.png)

-*Figure 4 — Within-accountant similarities, Firm A vs Firms B/C/D: (a) cosine, (b) dHash. Firm A's mass sits near cosine = 1 and dHash = 0 (medians 0.986 / 2) against Firms B/C/D's 0.959 / 7; dashed lines mark the cuts (cosine 0.95; dHash 5), the dotted line the LH/UN crossover (0.8547). The held-out cross-firm HC rate (0.42%) sits at/below the clean reference ICCR (0.59%), while the within-Firm-A HC rate is 82% — the signal is inside the firm (annotation below panels).*
+*Figure 4. Within-accountant similarities, Firm A vs Firms B/C/D: (a) cosine, (b) dHash. Firm A's mass sits near cosine = 1 and dHash = 0 (medians 0.986 / 2) against Firms B/C/D's 0.959 / 7; dashed lines mark the cuts (cosine 0.95; dHash 5), the dotted line the LH/UN crossover (0.8547). The held-out cross-firm HC rate (0.42%) sits at/below the clean reference ICCR (0.59%), while the within-Firm-A HC rate is 82% — the signal is inside the firm (annotation below panels).*

 (2) Ranking accountants by similarity, in each period. Ranking every accountant in Firms A–D by a single within-accountant similarity score, separately for 2013–2019 and for 2020–2023, Firm A's accountants sit at the high-similarity (templated) end. A descriptive three-group summary of the two-measure space tells the same story: its high-cosine/low-dHash group holds 82.5% of Firm A's accountants and almost none of the others' (Table III). The period split confirms the expected pattern: Firm A's per-signature HC rate is at the top in both periods (80.3% in 2013–2019, 83.8% in 2020–2023), while Firms B/C/D move upward after 2020 as the formal systems came in — Firm B from 29.0% to 42.0%, Firm C from 21.6% to 26.7%, Firm D from 22.0% to 28.0% (see Section V-B).

@@ -232,9 +236,9 @@ Four further checks confirm the contrast is not an artefact of how the compariso
 | Firm C | 102 | 1.0% |
 | Firm D | 52 | 1.9% |

-![Figure 5](figures/fig5.png)
+![](figures/fig5.png)

-*Figure 5 — Per-accountant HC rate, ranked, one panel per period (2013–2019; 2020–2023), points colored by firm (accountants with ≥ 5 signatures in the period). Firm A (red) occupies the templated top of the ranking in both periods; Firms B/C/D rise after 2020 (HC rate B 29.0→42.0%, C 21.6→26.7%, D 22.0→28.0%; Firm A 80.3→83.8%), consistent with staggered formal-system adoption (Section V-B).*
+*Figure 5. Per-accountant HC rate, ranked, one panel per period (2013–2019; 2020–2023), points colored by firm (accountants with ≥ 5 signatures in the period). Firm A (red) occupies the templated top of the ranking in both periods; Firms B/C/D rise after 2020 (HC rate B 29.0→42.0%, C 21.6→26.7%, D 22.0→28.0%; Firm A 80.3→83.8%), consistent with staggered formal-system adoption (Section V-B).*

 (3) Applying the calibrated rule to Firm A, 2013–2023. Taking the operating point calibrated on Firms B/C/D in 2013–2019 and applying it across Firm A's full record, 81.70% of Firm A's signatures (82% rounded) land in HC (per signature; the full five-way breakdown is in Table IV). Read together with the interview fact that Firm A mainly uses overlay stamping, the system's firm-level output matches the practice the firm itself describes. We say this carefully: it is a match at the firm level, not a label on any single signature. We do not classify the individual signatures as non-hand-signed, because for any one signature the two measures cannot separate reuse from a shared scanning pipeline or a uniform house style (Section III-D).

@@ -300,7 +304,7 @@ How sensitive the operating point is. Right around the HC cutoff the per-signatu

 A single slope understates how the rule behaves, so we map the full surface rather than defend one cut. Figure 6 plots, over the entire (cosine cut × dHash cut) plane, the clean-group flag rate (panel a) and the Firm A − B/C/D flag-rate contrast (panel b), and neither view favours the chosen cut by construction. First, the surfaces are smooth: there is no cliff at (0.95, dHash ≤ 5), so the operating point is a readable choice on a continuous trade-off rather than a discovered boundary (Section V-A), and an operator who wants a tighter floor can move toward higher cosine and lower dHash and read the consequence off the surface. Second, the firm contrast is not an artefact of the threshold: it exceeds 45 percentage points across a broad region of low-dHash, high-cosine cuts and in fact grows as the cut tightens (for example 58 pp at cosine 0.97, dHash ≤ 3), so the deliberately looser HC point trades a few points of contrast for catching more reuse, not the reverse. The same surface makes the weakness of the cosine-only direction explicit: extending the structural cut to the MC bound (dHash ≤ 15) roughly halves the contrast (to about 27 pp) while sharply inflating the clean-group flag rate. That is precisely why the MC band is only advisory and the cosine-only HSC band carries no weight (Section III-D): the partition is not drawn to flatter the narrative, and the surface shows directly where each band earns its keep and where it does not.

-![Figure 6](figures/fig6.png)
+![](figures/fig6.png)

 *Figure 6. Sensitivity surface of the deployed rule over the two-measure threshold plane (Big-4, n = 150,441). (a) Clean-group (B/C/D) flag rate at each (cosine cut, dHash cut); the chosen HC operating point (star) sits in a low-rate, high-specificity region with no cliff. (b) Firm A minus B/C/D flag-rate contrast (percentage points); the contrast exceeds 45 pp across a broad low-dHash, high-cosine band and weakens toward the MC bound (dHash ≤ 15, dotted), so the operating point is not a cherry-picked threshold and the MC band is visibly the less discriminating region.* The MC/HSC boundary at dHash = 15 sits in a flat (saturating) region, where moving the line adds flagged cases without adding specificity; this is a further reason to treat the MC band as advisory (Section IV-B).

@@ -308,7 +312,17 @@ Leaving out one firm at a time. A two-group fit is unstable across firms — its

 Crossover scope. The low cosine cut is the same-vs-different-accountant cosine crossover; recomputing it across scopes moves it by at most 0.025 — 0.8547 on the calibration cell (the primary value; Section IV-A), 0.8367 corpus-wide, 0.8489 on the all-period baseline firms, 0.8302 with the non-Big-4 firms added — and because the cut affects only the UN/LH boundary, switching among these scopes changes no HC/MC/HSC result and shifts the UN/LH split by at most 0.4 percentage points per firm. We use the calibration-cell value as primary for held-out discipline and report the others as robustness.

-The same-pair variant. Recomputing the rule so that a single partner signature must satisfy both inequalities at once (the same-pair rule of Section III-D) leaves every conclusion unchanged. The within-firm concentration of cross-accountant matches is in fact higher under same-pair (97.0–99.96% across the four firms) than under the deployed any-pair rule (76.7–98.8%), so the headline structure does not depend on the any-pair construction — pushed to the stricter event, it gets stronger.
+The same-pair variant. A reader may worry that the deployed rule is a *derived* statistic rather than an observation: the cosine maximum and the dHash minimum are each taken over the accountant's pool and can originate from different partner signatures, so the high-confidence region might in principle be assembled from two unrelated extrema. We therefore recompute the rule under the strict *same-pair* construction, where a single partner signature must satisfy both inequalities at once (Section III-D), and report it in the main text rather than the supplement. Two views agree. First, the within-firm concentration of cross-accountant matches is higher under same-pair (97.0–99.96% across the four firms) than under the deployed any-pair rule (76.7–98.8%). Second, and more directly, the per-signature HC flag rate — the quantity the any-pair concern targets — behaves the same way (Table VI): requiring one partner to satisfy both inequalities lowers every firm's rate, as expected, but it *widens* the firm gap rather than narrowing it. Firm A still fires on a majority of its own signatures (57.3%) while the baseline firms fall to 5–9%, so the Firm-A-to-baseline ratio rises from about 2.4–3.4× under any-pair to about 6.4–10.8× under same-pair. The high-confidence region is therefore not an artefact of combining extrema from different partner signatures; pushed to the stricter event, the structure gets stronger.
+
+**Table VI — HC flag rate by firm under the deployed any-pair rule and the strict same-pair rule.**
+
+| Firm | Signatures | Any-pair HC | Same-pair HC |
+|---|---|---|---|
+| Firm A | 60,448 | 81.7% | 57.3% |
+| Firm B | 34,248 | 34.6% | 9.0% |
+| Firm C | 38,993 | 23.7% | 5.3% |
+| Firm D | 16,752 | 24.7% | 7.7% |
+| All Big-4 | 150,441 | 49.6% | 27.3% |

 Each gate adds specificity. On the all-four-firm pool the cosine gate alone fires per comparison at 6.0×10⁻⁴; adding the structural gate multiplies this by 0.234 (the conditional ICCR of dHash ≤ 5 given cos > 0.95), giving the joint 1.4×10⁻⁴. Each axis contributes specificity beyond the other — quantitative support for the two-gate design over either measure alone (Section I, Section III-D).

@@ -0,0 +1,28 @@
+"""Table VI: HC flag rate by firm under any-pair (deployed) vs strict same-pair rule.
+same-pair dHash = Hamming distance between a signature's dHash and its cosine-closest
+same-accountant partner (closest_match_file). Reproduces from signature_analysis.db."""
+import sqlite3
+from collections import defaultdict
+DB="/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db"
+BIG4=('勤業眾信聯合','資誠聯合','安侯建業聯合','安永聯合')
+FM={'勤業眾信聯合':'A','安侯建業聯合':'B','資誠聯合':'C','安永聯合':'D'}
+con=sqlite3.connect(DB);cur=con.cursor()
+cur.execute("SELECT image_filename, dhash_vector FROM signatures WHERE dhash_vector IS NOT NULL")
+dh={fn:bytes(b) for fn,b in cur.fetchall()}
+ham=lambda a,b: bin(int.from_bytes(a,'big')^int.from_bytes(b,'big')).count('1')
+cur.execute(f"""SELECT excel_firm,max_similarity_to_same_accountant,min_dhash_independent,closest_match_file,image_filename
+ FROM signatures WHERE is_valid=1 AND max_similarity_to_same_accountant IS NOT NULL
+   AND min_dhash_independent IS NOT NULL AND excel_firm IN ({','.join('?'*4)})""",BIG4)
+st=defaultdict(lambda:[0,0,0])
+for firm,cos,mindh,cmf,imf in cur.fetchall():
+    f=FM[firm]; st[f][0]+=1
+    st[f][1]+= (cos>0.95 and mindh<=5)
+    sp=ham(dh[imf],dh[cmf]) if (cmf in dh and imf in dh) else 99
+    st[f][2]+= (cos>0.95 and sp<=5)
+con.close()
+print(f"{'firm':5}{'n':>8}{'any-pair%':>11}{'same-pair%':>12}")
+T=[0,0,0]
+for f in 'ABCD':
+    n,a,s=st[f]; T=[T[0]+n,T[1]+a,T[2]+s]
+    print(f"{f:5}{n:>8}{100*a/n:>10.1f}%{100*s/n:>11.1f}%")
+n,a,s=T; print(f"{'all':5}{n:>8}{100*a/n:>10.1f}%{100*s/n:>11.1f}%")