Phase 6 round-7 codex 3-axis review fixes: 11 MAJOR + 5 MINOR

Codex GPT-5.5 3-axis peer review (paper/codex_review_gpt55_v4_round_3axis.md)
identified 11 MAJOR + 5 MINOR + 0 BLOCKER on three axes: (1) abstract/body
tone consistency, (2) methodology clarity / v3 residue, (3) no implicit
within-CPA or cross-year signature-consistency assumptions. 13 patches
applied across 4 source files; mirrored in paper_a_v4_combined.md.

Axis 1 (tone consistency between abstract and body):
- S I L33: "resolves the ambiguity" -> "provides complementary evidence
  for screening cases where ... hypotheses diverge"
- S I L35: "disproves the distributional-threshold path" -> "does not
  support the distributional-threshold path"
- S I L37 / S V-F L29: "characterise the deployed five-way classifier
  at three units" -> "characterise the deployed HC sub-rule and
  document-level HC+MC alarm derived from the five-way classifier at
  three units" (consistent with S V-H which says only HC sub-rule and
  HC+MC alarm are re-characterised by the present ICCR battery)
- S I L39 / S V-C / S III-L.4: "consistent with firm-specific template,
  stamp, or document-production reuse mechanisms" -> "consistent with --
  but does not independently establish -- firm-level template-like
  reuse, digitisation-pipeline homogeneity, or signing-style
  homogeneity, which descriptor-only data cannot separate (S V-H)"
  (mirrors abstract)

Axis 2 (methodology clarity / v3 residue):
- S III-G: added unit-bridge sentence distinguishing "descriptor-summary
  units" (signature/accountant) from "operational reporting units"
  (per-comparison/per-signature/per-document, S III-L)
- S III-H.2: "The calibration distinguishes two reference populations"
  -> "The supporting diagnostics use two reference populations" with
  explicit "neither is the calibration anchor"
- S III-L.1: "specificity" -> "ICCR refinement"
- S III-L.2: added "descriptive intuition, not an independence
  assumption used for estimation" caveat after the 1-(1-p)^n form

Axis 3 (no implicit signature-consistency assumptions):
- S III-F: hand-signing motivation rewritten as working hypothesis that
  "the classifier does not require ... to hold for all CPAs"
- S III-G A1: added "A1 does not assume temporal stability of
  handwriting or scanning workflow within or across years"
- S III-H.1: added label-caveat paragraph (operational rule outputs,
  not validated ground-truth classes); HC "strong replication evidence"
  -> "image-similarity evidence consistent with replication"; HSC
  "consistent with a CPA who signs very consistently" -> "mechanism not
  resolved by descriptor data alone"; LH explicitly owns that
  cross-year handwriting drift, scanner workflow change, or template
  variant rotation can also yield low max-cosine within a same-CPA pool
- S III-L.6 / S IV-M.6: "same-CPA repeatability signal" -> "observed
  same-CPA-pool excess ... not attributed to within-CPA handwriting
  repeatability"

Deferred (structural, not single-sentence patch): codex S III-I.2 /
S III-J K=2/K=3 deduplication; codex S III-K LOOO / S III-J duplication.
Both are MINOR stylistic redundancies, not reviewer-rejection risks.

DOCX rebuilt via export_v3.py; v4.0_20260515 file refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 03:11:53 +08:00
parent 3672c9343e
commit becce857e1
8 changed files with 154 additions and 40 deletions
Binary file not shown.
Binary file not shown.
+114
View File
@@ -0,0 +1,114 @@
[軸 1]
[MAJOR] §I Contributions, L42
原句:「resolves the ambiguity between *style consistency* and *image reproduction*
問題:這比摘要語氣強。descriptor-only framework 不能真正「解開」style consistency 與 image reproduction 的機制歸因,§V-H 也說不能分離。
修改建議:改為「provides complementary evidence for screening cases where style consistency and image reproduction hypotheses diverge」。
[MAJOR] §III-H.1, L314
原句:「High-confidence non-hand-signed (HC)」
問題:作為 rule label 可接受,但正文與表格反覆使用時,容易讀成已驗證分類結果,而非 screening label。
修改建議:改為「High-confidence image-reuse screening label (HC)」,並在首次定義處明說「label names are operational labels, not ground-truth classes」。
[MAJOR] §III-H.1, L314
原句:「Both descriptors converge on strong replication evidence.」
問題:「strong replication evidence」過強;目前只保證兩個影像相似 descriptor 同時落入 rule box,不能保證 replication mechanism。
修改建議:改為「Both descriptors converge on image-similarity evidence consistent with replication」。
[MAJOR] §III-H.1, L318
原句:「Likely hand-signed (LH): Cosine $\leq 0.837$.」
問題:沒有 hand-signed ground truth,不能把 low-similarity screening bin 命名成「likely hand-signed」而不冒認 ground-truth status。
修改建議:改為「Low-replication-similarity (LRS)」或「Low-alert similarity」,保留舊縮寫可在括號說明。
[MAJOR] §V-C, L1060
原句:「similar, milder production-related reuse patterns at Firms B/C/D」
問題:這裡把 Firms B/C/D 的較溫和 within-firm collision 解讀為 production-related reuse,和 §V-H 的三機制不可分離聲明不一致。
修改建議:改為「similar, milder within-firm collision patterns, whose mechanisms may include template reuse, digitisation-pipeline homogeneity, or signing-style homogeneity」。
[MINOR] §V-F, L1074
原句:「the deployed five-way classifier is characterised at three units」
問題:§V-H L1100 說 MC/HSC 與 document worst-case rule 未被本診斷組重新 characterise;這句像是整個 five-way classifier 都完成 ICCR calibration。
修改建議:改為「the HC sub-rule and document-level alarm definitions derived from the five-way output are characterised...」。
[MINOR] §I Contributions, L44
原句:「Composition decomposition disproves the distributional-threshold path.」
問題:「disproves」語氣過硬;目前是對本資料與本診斷下不支持 natural-threshold reading。
修改建議:改為「does not support」或「rules out within the tested diagnostics」。
[軸 2]
[MAJOR] §III-G vs §III-L, L286 / L458
原句:「We analyse signatures at two units of resolution.」
問題:§III-G 說兩個 unitssignature/accountant),§III-L 又說 calibration 有三個 unitsper-comparison/per-signature/per-document)。讀者第一次讀會混淆「statistical summary unit」與「calibration/reporting unit」。
修改建議:在 §III-G 結尾加一個 bridgeaccountant/signature 是 descriptor-summary units;§III-L 的 three units 是 ICCR reporting units。
[MAJOR] §III-H.2, L326
原句:「The calibration distinguishes two reference populations」
問題:Firm A 後文反覆說不是 calibration anchor;這句仍像 v3 殘留,讓 Firm A 看起來參與 threshold calibration。
修改建議:改為「The supporting diagnostics use two reference populations」。
[MAJOR] §III-H.1 / §III-L, L320 / L456
原句:「retain their prior calibration provenance」
問題:§III-L 說本分析不 re-derive thresholds,但標題仍叫 threshold calibration,且 §III-H.1 只在 L320 一句帶過。第一次閱讀時不夠清楚:deployed 5-way rule 是既有 rule,ICCR 是行為刻畫,不是重新最佳化。
修改建議:在 §III-H.1 後加一小段明確列出:「rule definition」「what §III-L calibrates」「what remains from supplement」。
[MAJOR] §III-I.2 / §III-J, L342 / L369
原句:「K=2 / K=3 Gaussian mixture fits」
問題:K=2/K=3 數字、BIC、解讀在 §III-I.2 與 §III-J 重複,仍有 v3 splice 的疊床架屋感。
修改建議:§III-I.2 只保留「mixture path checked and demoted」摘要,完整模型細節集中到 §III-J。
[MINOR] §III-K, L432
原句:「Leave-one-firm-out reproducibility ... Discussed in §III-J above.」
問題:LOOO 已在 §III-J 詳述,又在 §III-K 作為 internal-consistency check;分類上不自然,且增加重複。
修改建議:把 §III-K.3 改成單句 cross-reference,或移回 §III-J。
[MINOR] §III-L.1, L489
原句:「dHash provides $\sim 4.3\times$ further per-comparison specificity」
問題:這裡漏了 proxy/disclaimer;全文已避免 FAR,但「specificity」單獨出現會弱化 ICCR 語氣。
修改建議:改為「specificity-proxy refinement」或「ICCR refinement」。
[MINOR] §III-L.2, L513
原句:「consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form」
問題:這是有用直覺,但 independence limit 與 within-firm violation 的關係應在同段提醒,否則會像正式模型。
修改建議:補一句「This is an intuition, not an independence assumption used for estimation」。
[軸 3]
[MAJOR] §III-F, L277
原句:「Hand-signing, by contrast, often yields high dHash similarity」
問題:這句預設同一 CPA 多次親簽時「overall layout typically preserved」,接近不應預設的個別 CPA 跨文件一致性。雖然用 often,但仍在方法動機處承擔了未驗證手寫行為。
修改建議:改為「One working hypothesis is that some hand-signed repetitions may preserve coarse layout while varying in fine execution; the classifier does not require this to hold for all CPAs」。
[MAJOR] §III-H.1, L316
原句:「consistent with a CPA who signs very consistently」
問題:HSC 被解讀成「同一 CPA 簽名很一致但非 reproduction」,這直接把高 cosine / 高 dHash 的 same-CPA pattern 歸因到個人書寫一致性。
修改建議:改為「high feature similarity without structural corroboration; mechanism unresolved」。
[MAJOR] §III-H.1, L318
原句:「Likely hand-signed」
問題:低 max-cosine 並不等於親簽;也可能是跨年度書寫變化、掃描/PDF pipeline、裁切或多 template variant。這是對「沒有高 same-CPA match」的過度解讀。
修改建議:改成 descriptor-based label,例如「low-replication-similarity」。
[MINOR] §III-G A1, L292
原句:「within the cross-year same-CPA pool」
問題:A1 本身不是年度一致性假設,但「cross-year」容易被讀成跨年度簽名應可比或應一致。
修改建議:改為「within the observed same-CPA candidate pool pooled over years; this does not assume temporal stability of handwriting or scanning」。
[MINOR] §III-L.6, L587
原句:「same-CPA repeatability signal」
問題:已加 caveat,但「repeatability」仍可能被讀成個人簽名一致性訊號。
修改建議:改為「observed same-CPA-pool excess signal, whose sources are not identifiable」。
[MINOR] §IV-M.6, L1043
原句:「interpreted as a same-CPA repeatability signal」
問題:同上,且出現在 results consolidation,容易被當成結果主張。
修改建議:改為「reported as same-CPA-pool excess under §III-M caveats, not attributed to handwriting repeatability」。
總體判讀
軸 1 verdict:大方向已和摘要一致,但仍有幾個「validated detector / mechanism attribution」味道偏重的句子,尤其是「resolves ambiguity」、HC/LH label、§V-C 對 B/C/D 的 production-related reuse。
軸 2 verdictv3 殘留大多已被 demote,但 §III 的敘事仍偏重複;最大問題是 unit taxonomy 與 calibration/re-characterisation 範圍需要更早講清。
軸 3 verdict:沒有發現核心計算邏輯必然依賴「同一 CPA 或跨年度簽名必須一致」;但若干命名與動機句會讓讀者以為有這個假設。
是否可送 partner 最終審查:可,但建議先做一輪小修,主要是改 label/語氣與 §III roadmap。
BLOCKER:無。
+2 -2
View File
@@ -12,7 +12,7 @@ The Big-4 accountant-level descriptor distribution rejects unimodality on both m
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H).
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
@@ -26,7 +26,7 @@ Three feature-derived scores agree on the per-CPA descriptor-position ranking at
## F. Anchor-Based Multi-Level Calibration
The operational specificity-proxy behaviour of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
The operational specificity-proxy behaviour of the deployed HC sub-rule and document-level HC$+$MC alarm derived from the five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.
+4 -4
View File
@@ -30,13 +30,13 @@ The contributions of this paper are:
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we support the backbone choice through a feature-backbone ablation.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where *style consistency* and *image reproduction* hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.
4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed high-confidence (HC) sub-rule and document-level HC$+$MC alarm derived from the five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with — but does not independently establish — firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
+13 -13
View File
@@ -106,7 +106,7 @@ Unlike DCT-based perceptual hashes, dHash is computationally lightweight and par
These descriptors provide partially independent evidence.
Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
Non-hand-signing is expected to yield extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both.
Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
One working hypothesis is that some hand-signed repetitions may preserve coarse layout while varying in fine execution, producing relatively higher dHash similarity than cosine similarity within a same-CPA pair; the classifier does not require this hypothesis to hold for all CPAs, and the descriptor-level pattern is used only as input to the deployed rule, not as a within-CPA consistency claim.
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
@@ -115,13 +115,13 @@ Cosine similarity on L2-normalised deep embeddings and dHash both remain stable
## G. Unit of Analysis and Scope
We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
We analyse signatures at two **descriptor-summary** units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. §III-L additionally characterises the deployed rule's behaviour at three **operational reporting** units (per-comparison, per-signature, per-document), which are distinct from the descriptor-summary units defined here: the descriptor-summary units summarise input descriptors; the operational reporting units summarise rule outputs.
We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.
We adopt one stipulation about same-CPA pair detectability:
> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation.*
> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the observed same-CPA candidate pool used by the max-cosine / min-dHash computation, pooled over the CPA's reports across years. A1 does not assume temporal stability of handwriting or scanning workflow within or across years.*
A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.
@@ -141,13 +141,13 @@ A1 is plausible for high-volume stamping or firm-level electronic signing workfl
### H.1. Deployed Operational Rule
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA:
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-M:
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; mechanism attribution remains subject to §III-M.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration; the descriptor signature is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. The "Likely hand-signed" name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; the label is operational, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
@@ -155,7 +155,7 @@ The remainder of this section (§III-H.2) describes the reference populations us
### H.2. Reference Populations
The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking.
The supporting diagnostics use two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. Neither population is the calibration anchor for the deployed threshold; both are descriptive references that inform the cross-checks in §III-K.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
@@ -318,7 +318,7 @@ The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-com
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$).
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison ICCR refinement (joint $0.00014$ vs cos-only $0.00060$).
The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
@@ -342,7 +342,7 @@ Per-firm any-pair rates (no bootstrap; descriptive):
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). This functional form is used as descriptive intuition for the broad monotone trend, not as an independence assumption used for estimation; the within-firm violation of inter-CPA independence (§III-L.4) bounds how literally the closed form can be read. Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M.
@@ -393,7 +393,7 @@ The per-decile per-firm breakdown (Script 44) confirms the pattern: within every
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H). We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$$83.7\%$ any-pair; $97.0$$98.2\%$ same-pair).
@@ -416,7 +416,7 @@ The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised
- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as an *observed same-CPA-pool excess* — a quantity that exceeds what random inter-CPA candidate replacement would produce — whose mechanism is not identifiable from descriptor-only data (§III-M); we do not attribute it to within-CPA handwriting repeatability or to image replication without further evidence.
## M. Unsupervised Diagnostic Strategy and Limits
+1 -1
View File
@@ -429,4 +429,4 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is reported as an observed same-CPA-pool excess under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
+20 -20
View File
@@ -39,13 +39,13 @@ The contributions of this paper are:
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we support the backbone choice through a feature-backbone ablation.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where *style consistency* and *image reproduction* hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.
4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed high-confidence (HC) sub-rule and document-level HC$+$MC alarm derived from the five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with — but does not independently establish — firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
@@ -274,7 +274,7 @@ Unlike DCT-based perceptual hashes, dHash is computationally lightweight and par
These descriptors provide partially independent evidence.
Cosine similarity is sensitive to the full feature distribution and reflects fine-grained execution variation; dHash captures only coarse perceptual structure and is robust to scanner-induced noise.
Non-hand-signing is expected to yield extreme similarity under *both* descriptors, since the underlying image is identical up to reproduction noise; scan-stage noise can in principle push a replicated pair off either extremum but rarely both.
Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
One working hypothesis is that some hand-signed repetitions may preserve coarse layout while varying in fine execution, producing relatively higher dHash similarity than cosine similarity within a same-CPA pair; the classifier does not require this hypothesis to hold for all CPAs, and the descriptor-level pattern is used only as input to the deployed rule, not as a within-CPA consistency claim.
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
@@ -283,13 +283,13 @@ Cosine similarity on L2-normalised deep embeddings and dHash both remain stable
## G. Unit of Analysis and Scope
We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
We analyse signatures at two **descriptor-summary** units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-H.1) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses. §III-L additionally characterises the deployed rule's behaviour at three **operational reporting** units (per-comparison, per-signature, per-document), which are distinct from the descriptor-summary units defined here: the descriptor-summary units summarise input descriptors; the operational reporting units summarise rule outputs.
We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.
We adopt one stipulation about same-CPA pair detectability:
> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the cross-year same-CPA pool used by the max-cosine / min-dHash computation.*
> **(A1) Pair-detectability.** *If a CPA uses image replication anywhere in the corpus, then at least one same-CPA signature pair is near-identical (after reproduction noise) within the observed same-CPA candidate pool used by the max-cosine / min-dHash computation, pooled over the CPA's reports across years. A1 does not assume temporal stability of handwriting or scanning workflow within or across years.*
A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.
@@ -309,13 +309,13 @@ A1 is plausible for high-volume stamping or firm-level electronic signing workfl
### H.1. Deployed Operational Rule
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA:
Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA. The five labels below name regions of the descriptor space and are operational rule outputs, not validated ground-truth classes; the label names reflect the screening hypothesis associated with each region and are subject to the unsupervised-setting caveats of §III-M:
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on image-similarity evidence consistent with replication; mechanism attribution remains subject to §III-M.
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level similarity is strong; structural similarity is present but below the high-confidence cutoff.
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration; the descriptor signature is operationally distinguished from HC/MC, but the underlying mechanism (within-CPA signing style, lossy image reproduction with structural drift, or a hybrid) is not resolved by descriptor data alone.
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$. The "Likely hand-signed" name reflects the screening hypothesis that low maximum same-CPA cosine similarity is more consistent with hand-signing variation than with image replication; the label is operational, not a verified hand-signed classification, since cross-year handwriting drift, scanner-workflow change, or template variant rotation within a CPA's reports can also yield a low max-cosine within a same-CPA pool.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
@@ -323,7 +323,7 @@ The remainder of this section (§III-H.2) describes the reference populations us
### H.2. Reference Populations
The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking.
The supporting diagnostics use two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking. Neither population is the calibration anchor for the deployed threshold; both are descriptive references that inform the cross-checks in §III-K.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
@@ -486,7 +486,7 @@ The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-com
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$).
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison ICCR refinement (joint $0.00014$ vs cos-only $0.00060$).
The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
@@ -510,7 +510,7 @@ Per-firm any-pair rates (no bootstrap; descriptive):
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). This functional form is used as descriptive intuition for the broad monotone trend, not as an independence assumption used for estimation; the within-firm violation of inter-CPA independence (§III-L.4) bounds how literally the closed form can be read. Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). A stricter operating point of dHash $\leq 3$ same-pair would correspond to a per-signature ICCR of $\approx 0.05$; the deployed HC any-pair rule with $\text{dHash} \leq 5$ corresponds to $\approx 0.11$. Stakeholders requiring a tighter specificity proxy could consider the dHash $\leq 3$ same-pair variant, with the unsupervised-setting caveats of §III-M.
@@ -561,7 +561,7 @@ The per-decile per-firm breakdown (Script 44) confirms the pattern: within every
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H). We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$$83.7\%$ any-pair; $97.0$$98.2\%$ same-pair).
@@ -584,7 +584,7 @@ The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised
- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as an *observed same-CPA-pool excess* — a quantity that exceeds what random inter-CPA candidate replacement would produce — whose mechanism is not identifiable from descriptor-only data (§III-M); we do not attribute it to within-CPA handwriting repeatability or to image replication without further evidence.
## M. Unsupervised Diagnostic Strategy and Limits
@@ -1040,7 +1040,7 @@ Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ ($38.6$ pp) per-signature and $0.4431$ ($44.3$ pp) per-document; this excess is reported as an observed same-CPA-pool excess under §III-M caveats, not as a presumed true-positive rate and not attributed to within-CPA handwriting repeatability.
# V. Discussion
@@ -1057,7 +1057,7 @@ The Big-4 accountant-level descriptor distribution rejects unimodality on both m
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). Byte-level decomposition of the 145 Firm A pixel-identical signatures (see supplementary materials) shows they span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches occurring across different fiscal years.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder production-related reuse patterns at Firms B/C/D.
We treat Firm A as a *templated-end case study within the Big-4 sub-corpus* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: under the deployed any-pair rule, within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms). Firm A's per-document D2 inter-CPA proxy ICCR of $0.6201$ (versus Firms B/C/D's $0.09$$0.16$) — the counterfactual rate at which Firm A documents would fire HC$+$MC if same-CPA pools were replaced by random inter-CPA candidates — reflects high inter-CPA collision concentration under the deployed rule, consistent with firm-specific template, stamp, or document-production reuse. (The corresponding observed rate on real same-CPA pools, from Table XVI, is substantially higher: $97.5\%$ HC$+$MC for Firm A; the proxy and observed rates measure different quantities and are not directly comparable.) The inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence above (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the within-firm collision pattern at all four Big-4 firms is consistent with similar, milder within-firm collision patterns at Firms B/C/D, whose mechanisms may include template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity (§V-H).
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
@@ -1071,7 +1071,7 @@ Three feature-derived scores agree on the per-CPA descriptor-position ranking at
## F. Anchor-Based Multi-Level Calibration
The operational specificity-proxy behaviour of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
The operational specificity-proxy behaviour of the deployed HC sub-rule and document-level HC$+$MC alarm derived from the five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR is consistent with the corpus-wide rate reported in §IV-I (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5) shows the deployed HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); alternative operating points can be characterised by inverting the ICCR curves (e.g., a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint corresponds to per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.