Paper A v13 rev9: sensitivity surface + honesty fixes (GPT-5.5 hostile review)

Pre-emptively address the three residual points from a hostile GPT-5.5 reviewer pass that rev8 had not fully closed (the rest of that review matched the already-applied fusion revision): - Sensitivity surface (Major 5): new Figure 6 maps the deployed rule over the full (cosine cut x dHash cut) plane - clean-group flag rate and the Firm A-minus-B/C/D contrast. Shows no cliff at (0.95, dHash<=5), contrast >45pp across a broad region (58pp at 0.97/dHash<=3), and that extending to the MC bound (dHash<=15) halves the contrast - so the thresholds are not cherry-picked and the weaker MC band is shown, not hidden. Reproducible via make_fig6_sensitivity.py (DB columns only). - Soften "reuse-dominated" (Major 1): the assertion that Firm A "is" a reuse-dominated population now reads "behaves in the screen as," explicitly resting on interviews + byte-identity rather than per-signature ground truth; two other uses made conditional/generic. - Shared-pipeline contamination of ICCR (Major 2): Sec III-E now names the shared within-firm imaging pipeline (scanners, PDF assembly, red-stamp removal) as a channel that can lift the inter-CPA rate above true chance, distinct from "shared template," supported by the Sec V-B pipeline audit; bias direction (higher floor) keeps the Firm-A contrast conservative. rev9 docx rebuilt (6 figures embedded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Qn59FdF9JMyfFg3sjcUNNG
2026-06-23 14:59:55 +08:00
parent da455791de
commit cb38d413ad
4 changed files with 72 additions and 5 deletions
@@ -133,7 +133,7 @@ With no labeled negatives to learn from, the calibration uses a stand-in: a grou

 Why not all four firms. As Section IV-C will show, almost all of one firm's between-accountant matches fall on other accountants of the same firm, and we have byte-level proof of image reuse across about fifty of that firm's partners. If we put Firm A into the reference group, we would be filling the "by chance" rate with exactly the within-firm matches the rule is supposed to catch — a circular calibration. So we use Firms B/C/D as the clean reference group and keep Firm A as a test case; we report the all-four-firm number only to show how much Firm A contaminates it.

-Why 2013–2019. We further limit the reference group to the years before formal firm-wide electronic-signing systems (adopted from 2020 onward; Section III-A). What this buys us is the absence of a shared template across accountants — not a guarantee that every signature was handwritten. The interviews say some baseline firms used informal individual stamping before 2020, but each accountant's stored image was their own, so different accountants' signatures still match only by chance; the chance rate is about matches between accountants, which individual stamping does not inflate. After 2020, formal systems standardize how reports are assembled, so that period is not a clean reference — and indeed the chance rate rises after 2020 (Section V-B). We therefore calibrate on the Firms-B/C/D 2013–2019 cell and score every held-out cell against it.
+Why 2013–2019. We further limit the reference group to the years before formal firm-wide electronic-signing systems (adopted from 2020 onward; Section III-A). What this buys us is the absence of a shared template across accountants — not a guarantee that every signature was handwritten. The interviews say some baseline firms used informal individual stamping before 2020, but each accountant's stored image was their own, so different accountants' signatures still match only by chance; the chance rate is about matches between accountants, which individual stamping does not inflate. One further channel deserves to be named, because it is not the template and we cannot fully exclude it: accountants at the same firm pass through a shared imaging pipeline — common scanners, PDF-assembly software, and the red-stamp-removal step (Section III-C, Section V-B) — and a shared pipeline can imprint correlated artefacts on otherwise-unrelated signatures, which would lift the inter-CPA rate above true chance. The pipeline audit of Section V-B confirms that such shared production paths exist and change over time. This is a reason to read the ICCR as a *specificity proxy* rather than a literal coincidence rate; its bias, like reference contamination, runs toward a higher floor, which makes the Firm-A contrast more conservative rather than less. After 2020, formal systems standardize how reports are assembled, so that period is not a clean reference — and indeed the chance rate rises after 2020 (Section V-B). We therefore calibrate on the Firms-B/C/D 2013–2019 cell and score every held-out cell against it.

 We report the rule's chance rate at three levels, because the rule takes the best match over a pool and so the per-signature rate is not the same as the per-pair rate: per comparison (sampled pairs of different accountants), per signature, and per report, each with a confidence interval. We call this the inter-CPA coincidence rate (ICCR) rather than a "false-acceptance rate," which we reserve for settings that have labeled negatives. The ICCR is a *between-accountant* coincidence rate: how often the rule fires on the signatures of two *different* accountants. It is therefore at best a proxy for specificity, and only under the stated assumption (no shared template across accountants). It is important to be exact about what it is not. The quantity the reuse question actually needs is the *within-accountant* false-positive rate — how often the rule would fire on a genuinely consistent hand-signer's own signatures — and that rate is not estimable here, because no accountant in the corpus is labeled as a known hand-signer. We considered benchmarking it against an external corpus of genuine repeated signatures (a public signature dataset supplies many authentic samples per writer), but such corpora are a different population and script acquired under a different pipeline, so the resulting rate would not transfer to this setting; importing it would reintroduce exactly the kind of unverifiable cross-distribution assumption our label-free calibration is built to avoid. We therefore report the limitation rather than a misleading proxy. The ICCR is not even a bound on it: a uniform individual hand keeps cosine high by design, so a true hand-signer's within-accountant fire rate can sit far *above* the between-accountant coincidence rate. Any statement that divides a firm's within-accountant fire rate by this between-accountant floor (an "X× the floor" comparison) therefore overstates the gap — the bias runs in the anti-conservative direction — and we do not report such ratios as effect sizes. Read as a between-accountant specificity proxy under the stated assumption, the ICCR is faithful to the evidence; read as a true error rate for the reuse question, it would claim more than we can show.

@@ -191,7 +191,7 @@ One point in Table II-b needs to be made explicit, because at first glance it lo

 ### B. From Categories to Actions: Review as Exception Management

-The proportions first, stated plainly. Before saying what to do with each category, we show how much of the data falls into each (Table IV, full corpus): HC 49.6%, MC 26.5%, UN 23.4%, HSC 0.2%, LH 0.3%. The ambiguous middle (MC + UN) is therefore not a fringe: about half of all four-firm signatures, and 65–76% at Firms B/C/D individually, against 18.1% at Firm A. Read against the institutional background (Section III-A), this is exactly the expected shape. Firm A is a reuse-dominated population, where the screen settles most signatures outright; Firms B/C/D in this period are mixed populations in which hand-signing and informal individual stamping coexist, so per-signature similarity is genuinely ambiguous there. The right response to a large middle is not to hide it but to give it a disposition path that does not require a per-signature verdict. Four moves do this. To be clear about what is established versus proposed here: the category proportions above, the per-band chance rates, and the byte-identity counts are empirical results, whereas the four moves are a *designed* operating procedure derived from the calibration — they are an argument that the workload is tractable, not a validated workflow. The protocol's end-to-end first run on a bounded, human-labeled sample, which is what would actually measure its discriminating behavior, is left to future work (Section V); we therefore present the moves as the intended use of the calibrated rule, not as evidence in their own right.
+The proportions first, stated plainly. Before saying what to do with each category, we show how much of the data falls into each (Table IV, full corpus): HC 49.6%, MC 26.5%, UN 23.4%, HSC 0.2%, LH 0.3%. The ambiguous middle (MC + UN) is therefore not a fringe: about half of all four-firm signatures, and 65–76% at Firms B/C/D individually, against 18.1% at Firm A. Read against the institutional background (Section III-A), this is exactly the expected shape. Firm A behaves in the screen as a reuse-dominated population — a reading consistent with the interviews and with the byte-identical evidence, though it rests on those rather than on per-signature ground truth — and the screen settles most of its signatures outright; Firms B/C/D in this period are mixed populations in which hand-signing and informal individual stamping coexist, so per-signature similarity is genuinely ambiguous there. The right response to a large middle is not to hide it but to give it a disposition path that does not require a per-signature verdict. Four moves do this. To be clear about what is established versus proposed here: the category proportions above, the per-band chance rates, and the byte-identity counts are empirical results, whereas the four moves are a *designed* operating procedure derived from the calibration — they are an argument that the workload is tractable, not a validated workflow. The protocol's end-to-end first run on a bounded, human-labeled sample, which is what would actually measure its discriminating behavior, is left to future work (Section V); we therefore present the moves as the intended use of the calibrated rule, not as evidence in their own right.

 Move 1 — calibrate each band's evidential weight, and demote what fails. The calibration tells us what each flag is worth. The HC band fires by chance on only about 1.2% of reports in the clean reference group, so an HC flag is close to self-certifying: it needs essentially no verification effort, and it goes straight onto the action list — findings to count, report, or investigate — rather than onto a list of flags still to be checked. The MC band fires by chance on about 17.5% of reports in the clean reference group — roughly one clean-group report in six — and, unlike HC, this rate does not drop when Firm A's accountants are excluded from the cross-accountant comparison pool (it edges up, because removing Firm A's distinctive template leaves a pool whose members resemble one another a little more at the coarse dHash ≤ 15 scale); the boundary at dHash = 15 also sits in a flat region of the sensitivity sweep, adding flagged cases without adding specificity (Section V-C). An MC flag on its own therefore carries almost no information and does not justify verification effort; it matters only in combination with other evidence. The UN band is ambiguous in the same spirit and is treated alongside MC; on the clean baseline the UN cosine band is reached by chance about 88% of the time per signature (98.2% per report), confirming that a UN flag is essentially uninformative about reuse on its own, whereas the HSC band is reached by chance only about 0.13% of the time per signature (0.25% per report) and in any case points away from reuse (style match without structural support). The HSC band is tiny (0.2%), so it warrants only a light spot-check. The LH band needs no action. Demotion, however, only says what an MC or UN flag is not — standalone evidence; what becomes of these signatures is the business of the next three moves: their information flows into the accountant-level scores (Move 2), which byte-identity hits then sharpen by proving that an accountant's stored image is in circulation (Move 3); the residual's data needs are named rather than guessed at (Move 4); and where a human does look at individual cases, the bounded protocol specified below applies.

@@ -203,7 +203,7 @@ Move 4 — state what the residual needs, instead of classifying it anyway. Afte

 Where a human does look, the review follows a defined and bounded protocol. We specify the protocol here as a design deliverable of the method: the discriminating behaviors stated below are design expectations, following from the artifact properties of reused versus independently signed images, and the protocol's first execution, on a bounded sample, is listed as future work (Section VI). (1) Side-by-side overlay inspection: the reviewer is shown the flagged signature next to the same-accountant signature(s) that produced its score, with a pixel-difference overlay and an edge-aligned superposition; a reused image is expected to overlay almost exactly, whereas two independent signings show natural variation in pressure, ink, and baseline. (2) Secondary artifact checks not used by the rule — exact registration, JPEG and scan-noise fingerprints (the compression and anti-aliasing traces a reused raster carries with it), and scaling traces — are designed to separate a reused raster from a re-scanned genuine signature at low cost. (3) Document and time context: the reviewer checks whether the matched signatures come from reports of different dates or engagements (reuse across time is more telling than within a single filing) and whether the surrounding layout shows a standard template or stamp. (4) Bounded per-accountant sampling: because the operational question is usually at the accountant or firm level, the reviewer judges a bounded random sample per accountant rather than every flagged signature, keeping the effort proportional to the number of accountants, not the number of signatures. (5) Feedback into calibration: each adjudicated case yields a label — reuse, hand-signed, or undetermined — and these accumulate into the small ground-truth set the setting otherwise lacks, which can later tighten the operating point or support supervised validation. The protocol's relation to Move 4 is one of scale: steps 1–3 apply per-case versions of the same artifact evidence that Move 4 would collect corpus-wide, step 4 bounds how many cases a human ever sees, and step 5 accumulates the labeled set Move 4 asks for. What the protocol cannot do — and is not claimed to do — is resolve the residual at scale; that is exactly what the corpus-wide metadata collection of Move 4 would add.

-Why this is exception management rather than caseload. In a reuse-dominated population the high-confidence tier settles most signatures directly (82% at Firm A), and the four moves reduce the remaining 18% to per-accountant judgments and a review queue bounded by the number of accountants rather than the number of signatures — exception management at the signature level. In a mixed population the same machinery delivers the same promise one level up, at the accountant. Move 1 removes the bulk of the apparent caseload outright: an MC flag alone does not justify verification, which at Firms B/C/D takes 29–41% of signatures off the worklist before anyone is assigned to anything (the UN band carries no flag in the first place, so the demotion bites on MC alone). Move 2 positions every accountant on the replication-dominance spectrum, so attention concentrates on the few high-ranked or mixed cases rather than on tens of thousands of signatures. Move 3 supplies proof where proof exists: 117 of the 262 byte-identical signatures sit at Firms B/C/D, demonstrating that stored-image reuse is a real practice at the mixed firms too, and anchoring the accountant-level judgments there. And the staggered post-2020 adoption of formal systems gives the mixed firms a readable time axis (Section V-B). What is not delivered in a mixed population is a per-signature verdict for the ambiguous middle — a limit of identification, not of workload. Exception management therefore holds in both settings; what changes is only the level at which exceptions are defined — the signature where reuse dominates, the accountant where practices are mixed. Because the cutoffs are tunable, a reviewer who wants higher specificity can tighten them (for example, cosine > 0.98 and dHash ≤ 3), trading a lighter caseload against the risk of missing noisier reused signatures — a trade-off we cannot tune against recall, since recall is unobservable here.
+Why this is exception management rather than caseload. Where a firm's output is dominated by reuse, the high-confidence tier settles most signatures directly (82% at Firm A), and the four moves reduce the remaining 18% to per-accountant judgments and a review queue bounded by the number of accountants rather than the number of signatures — exception management at the signature level. In a mixed population the same machinery delivers the same promise one level up, at the accountant. Move 1 removes the bulk of the apparent caseload outright: an MC flag alone does not justify verification, which at Firms B/C/D takes 29–41% of signatures off the worklist before anyone is assigned to anything (the UN band carries no flag in the first place, so the demotion bites on MC alone). Move 2 positions every accountant on the replication-dominance spectrum, so attention concentrates on the few high-ranked or mixed cases rather than on tens of thousands of signatures. Move 3 supplies proof where proof exists: 117 of the 262 byte-identical signatures sit at Firms B/C/D, demonstrating that stored-image reuse is a real practice at the mixed firms too, and anchoring the accountant-level judgments there. And the staggered post-2020 adoption of formal systems gives the mixed firms a readable time axis (Section V-B). What is not delivered in a mixed population is a per-signature verdict for the ambiguous middle — a limit of identification, not of workload. Exception management therefore holds in both settings; what changes is only the level at which exceptions are defined — the signature where reuse dominates, the accountant where practices are mixed. Because the cutoffs are tunable, a reviewer who wants higher specificity can tighten them (for example, cosine > 0.98 and dHash ≤ 3), trading a lighter caseload against the risk of missing noisier reused signatures — a trade-off we cannot tune against recall, since recall is unobservable here.

 ### C. Held-Out Benchmark: Firm A (a Known Positive)

@@ -248,7 +248,7 @@ Four further checks confirm the contrast is not an artefact of how the compariso
 | Firm D | 24.51% | 29.33% | 0.22% | 45.28% | 0.66% | 17,133 |
 | Overall | 49.58% | 26.47% | 0.21% | 23.42% | 0.32% | 150,442 |

-Reading the five-way mix across firms. Table IV is also the quantitative basis for the positioning in Section IV-B. At Firm A the ambiguous middle (MC + UN) is 18.1% — the screen reads a reuse-dominated population almost cleanly, with four signatures in five settled outright. At Firms B/C/D the middle is 65–76% — the signature of a mixed population in which hand-signing and informal stamping coexist (Section III-A), where per-signature similarity is genuinely ambiguous. There the screen's deliverables move up one level (Section IV-B): the MC share (29–41% of these firms' signatures, against the 26.5% corpus-wide MC share) is demoted off the worklist, the accountant-level scores rank these firms' accountants alongside everyone else's, and the byte-identical signatures at these firms (117 of the 262) are threshold-free proof that reuse occurs there too. The per-signature mix stays ambiguous; the disposition does not.
+Reading the five-way mix across firms. Table IV is also the quantitative basis for the positioning in Section IV-B. At Firm A the ambiguous middle (MC + UN) is 18.1% — the screen reads this population almost cleanly, with four signatures in five settled outright. At Firms B/C/D the middle is 65–76% — the signature of a mixed population in which hand-signing and informal stamping coexist (Section III-A), where per-signature similarity is genuinely ambiguous. There the screen's deliverables move up one level (Section IV-B): the MC share (29–41% of these firms' signatures, against the 26.5% corpus-wide MC share) is demoted off the worklist, the accountant-level scores rank these firms' accountants alongside everyone else's, and the byte-identical signatures at these firms (117 of the 262) are threshold-free proof that reuse occurs there too. The per-signature mix stays ambiguous; the disposition does not.

 (+) Byte-identical signatures: direct evidence of reuse. Beyond the screening numbers, 262 signatures across the four firms are byte-for-byte identical to another signature — 145 of them at Firm A, spread across about fifty partners. Identical files cannot come from independent hand-signing, so their existence is direct, hard evidence that image reuse happens and that it concentrates at Firm A. These pairs are not a bookkeeping artefact: every one of the 262 matches a signature in a *different* report PDF (none is the same file double-counted), and 170 of the 262 fall in different filing months, so duplicate filings or corrected re-submissions of one report cannot explain them. One caveat belongs with this count, developed in Section V-B: most of the 262 (232) occur in the post-2020 digital-native era, where exact reuse is both easier and perfectly preserved, so the raw count is not a clean prevalence trend; the pipeline-independent core is the 30 in the pre-2021 pure-scan era (18 at Firm A), which scanning noise alone cannot produce. Because a byte-identical pair has cosine = 1 and dHash = 0, it lands in HC by definition; the rule's "100% capture" of this set is therefore tautological, and we do not read it as a sanity check or a lower bound on recall. We use byte-identity only for what it can show directly — that reuse occurs and where it concentrates — as a prevalence signal, not a measure of detector performance.

@@ -296,7 +296,13 @@ The clean way to separate them would be an event study that aligns each firm to

 We summarize the robustness checks here; full detail is in the supplementary materials.

-How sensitive the operating point is. Right around the HC cutoff the per-signature firing rate changes quickly — its local slope is about 25× the median across a cosine sweep and about 3.8× across a dHash sweep — which confirms that the HC point is a chosen, specificity-anchored operating point rather than a natural gap. The MC/HSC boundary at dHash = 15 sits in a flat (saturating) region, where moving the line adds flagged cases without adding specificity; this is a further reason to treat the MC band as advisory (Section IV-B).
+How sensitive the operating point is. Right around the HC cutoff the per-signature firing rate changes quickly — its local slope is about 25× the median across a cosine sweep and about 3.8× across a dHash sweep — which confirms that the HC point is a chosen, specificity-anchored operating point rather than a natural gap.
+
+A single slope understates how the rule behaves, so we map the full surface rather than defend one cut. Figure 6 plots, over the entire (cosine cut × dHash cut) plane, the clean-group flag rate (panel a) and the Firm A − B/C/D flag-rate contrast (panel b), and neither view favours the chosen cut by construction. First, the surfaces are smooth: there is no cliff at (0.95, dHash ≤ 5), so the operating point is a readable choice on a continuous trade-off rather than a discovered boundary (Section V-A), and an operator who wants a tighter floor can move toward higher cosine and lower dHash and read the consequence off the surface. Second, the firm contrast is not an artefact of the threshold: it exceeds 45 percentage points across a broad region of low-dHash, high-cosine cuts and in fact grows as the cut tightens (for example 58 pp at cosine 0.97, dHash ≤ 3), so the deliberately looser HC point trades a few points of contrast for catching more reuse, not the reverse. The same surface makes the weakness of the cosine-only direction explicit: extending the structural cut to the MC bound (dHash ≤ 15) roughly halves the contrast (to about 27 pp) while sharply inflating the clean-group flag rate. That is precisely why the MC band is only advisory and the cosine-only HSC band carries no weight (Section III-D): the partition is not drawn to flatter the narrative, and the surface shows directly where each band earns its keep and where it does not.
+
+![Figure 6](figures/fig6.png)
+
+*Figure 6. Sensitivity surface of the deployed rule over the two-measure threshold plane (Big-4, n = 150,441). (a) Clean-group (B/C/D) flag rate at each (cosine cut, dHash cut); the chosen HC operating point (star) sits in a low-rate, high-specificity region with no cliff. (b) Firm A minus B/C/D flag-rate contrast (percentage points); the contrast exceeds 45 pp across a broad low-dHash, high-cosine band and weakens toward the MC bound (dHash ≤ 15, dotted), so the operating point is not a cherry-picked threshold and the MC band is visibly the less discriminating region.* The MC/HSC boundary at dHash = 15 sits in a flat (saturating) region, where moving the line adds flagged cases without adding specificity; this is a further reason to treat the MC band as advisory (Section IV-B).

 Leaving out one firm at a time. A two-group fit is unstable across firms — its boundary is basically a "Firm A versus the rest" divider — while a three-group fit keeps a stable shape (its low-cosine/high-dHash group drifts by at most 0.005 in cosine) but a membership that shifts with the mix of firms (by up to 12.8 percentage points). So we use the groups only as descriptions, never as operational labels.

@@ -0,0 +1,61 @@
+"""Figure 6: two-measure sensitivity surface over the (cosine cut x dHash cut) plane.
+Panel A: clean-group (B/C/D) flag rate -- how permissive the operating point is.
+Panel B: Firm A minus B/C/D flag-rate contrast (pp) -- discrimination across the plane.
+Shows the chosen HC point (0.95, dHash<=5) is not a cherry-picked threshold and exposes
+the weaker MC band (dHash<=15). Reproduces from signature_analysis.db (DB columns only).
+"""
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import numpy as np
+import sqlite3
+
+DB = "/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db"
+BIG4 = ('勤業眾信聯合', '資誠聯合', '安侯建業聯合', '安永聯合')
+
+con = sqlite3.connect(DB); cur = con.cursor()
+cur.execute(f"""SELECT CASE WHEN excel_firm='勤業眾信聯合' THEN 1 ELSE 0 END isA,
+  max_similarity_to_same_accountant c, min_dhash_independent d
+FROM signatures WHERE is_valid=1 AND excel_firm IN ({','.join('?'*4)})
+  AND max_similarity_to_same_accountant IS NOT NULL AND min_dhash_independent IS NOT NULL""", BIG4)
+rows = cur.fetchall(); con.close()
+isA = np.array([r[0] for r in rows], bool)
+c = np.array([r[1] for r in rows]); d = np.array([r[2] for r in rows])
+cA, dA = c[isA], d[isA]; cB, dB = c[~isA], d[~isA]
+
+cos_cuts = np.arange(0.85, 0.9901, 0.0025)
+dh_cuts = np.arange(0, 21, 1)
+A = np.zeros((len(dh_cuts), len(cos_cuts)))
+B = np.zeros_like(A)
+for j, cc in enumerate(cos_cuts):
+    for i, dd in enumerate(dh_cuts):
+        A[i, j] = 100 * np.mean((cA > cc) & (dA <= dd))
+        B[i, j] = 100 * np.mean((cB > cc) & (dB <= dd))
+contrast = A - B
+
+extent = [cos_cuts[0], cos_cuts[-1], dh_cuts[0], dh_cuts[-1]]
+fig, axes = plt.subplots(1, 2, figsize=(10.5, 4.3))
+
+for ax, Z, title, cmap, lab in [
+    (axes[0], B, '(a) Clean group (B/C/D) flag rate', 'viridis', 'flag rate (%)'),
+    (axes[1], contrast, '(b) Firm A − B/C/D contrast', 'magma', 'contrast (pp)')]:
+    im = ax.imshow(Z, origin='lower', aspect='auto', extent=extent, cmap=cmap)
+    cb = fig.colorbar(im, ax=ax, pad=0.02); cb.set_label(lab, fontsize=8); cb.ax.tick_params(labelsize=7)
+    # operating points
+    ax.scatter([0.95], [5], marker='*', s=180, color='white', edgecolor='black', zorder=5,
+               label='HC operating point (0.95, dHash≤5)')
+    ax.axhline(15, color='white', ls=':', lw=1.0)
+    ax.text(0.853, 15.4, 'MC upper bound (dHash≤15)', color='white', fontsize=6.5, va='bottom')
+    ax.set_xlabel('cosine cut', fontsize=9)
+    ax.set_ylabel('dHash cut (≤)', fontsize=9)
+    ax.set_title(title, fontsize=9)
+    ax.tick_params(labelsize=7.5)
+    ax.legend(loc='lower left', fontsize=6.5, framealpha=0.85)
+
+fig.suptitle('Figure 6. Sensitivity surface of the deployed rule over the two-measure threshold plane (Big-4, n=%d).' % len(c),
+             fontsize=9, y=1.02)
+fig.tight_layout()
+out = '/Volumes/NV2/pdf_recognize/paper/v13_build/figures/fig6.png'
+fig.savefig(out, dpi=200, bbox_inches='tight')
+plt.close(fig)
+print(f"fig6 OK n={len(c)}; HC(0.95,5) contrast={contrast[5, np.argmin(abs(cos_cuts-0.95))]:.1f}pp; written {out}")