Phase 6 round-3 codex-review fixes: blockers + majors + minors

Resolved Codex review (gpt-5.5 xhigh) findings against b6913d2.

BLOCKERS:
- Appendix B reference mismatch: rewrote all main-text "Appendix B" references
  to "supplementary materials" since Appendix B is now a redirect stub. Affected
  the SSIM design-argument pointer, threshold provenance, byte-level
  decomposition, MC band capture-rate, and backbone-ablation table references
  across §III-F / §III-H.1 / §III-H.2 / §III-K / §III-L.4 / §III-M / §IV-F /
  §IV-J / §IV-K / §IV-L / §V-C / §V-H.
- Table rendering: un-commented Tables I-IV (Dataset Summary, YOLO Detection,
  Extraction Results, Cosine Distribution Statistics) which were inside HTML
  comment blocks and would not have rendered in the submission.
- Table numbering out of order: Table XIX appeared before Tables XVI-XVIII.
  Renumbered XIX -> XVI (document-level worst-case counts), XVI -> XVII (Firm x
  K=3 cross-tab), XVII -> XVIII (K=3 component comparison), XVIII -> XIX
  (Spearman correlation). Cross-references updated in §IV-J / §IV-K and §V-C.
- Table V mis-citation: §IV-C said "KDE crossover ... (Table V)" but Table V is
  the dip test. Dropped the (Table V) tag; crossover is a textual finding.
- Submission cleanup: wrapped the archived Impact Statement section heading and
  body inside the existing HTML comment (was rendering). Funding placeholder
  wrapped in HTML comment with a TO-DO note (won't render but is preserved as
  reminder).

MAJORS:
- Line 1077 numerical conflation: rewrote the §V-C / §III-L.4 paragraph that
  labelled Firm A's per-document HC+MC inter-CPA proxy ICCR of 0.6201 as a rate
  "on real same-CPA pools." 0.6201 is a counterfactual proxy under inter-CPA
  candidate-pool replacement, not the observed rate. Added explicit disambig:
  the corresponding observed rate from Table XVI (formerly XIX) is 97.5%
  HC+MC for Firm A; the proxy and observed rates measure different quantities.
- Residual "validation" language softened: "Dual-descriptor verification" ->
  "Dual-descriptor similarity"; "we validate the backbone choice" -> "we
  support the backbone choice"; "pixel-identity validation" -> "pixel-identity
  positive-anchor check"; "## M. Validation Strategy and Limitations under
  Unsupervised Setting" -> "## M. Unsupervised Diagnostic Strategy and Limits".
- "Specificity behaviour" overclaim: "characterises the cosine threshold's
  specificity behaviour" -> "specificity-proxy behaviour" (methodology §III-L.0
  and discussion §V-F).
- "Prior published / prior calibration" ambiguity: replaced "prior published
  per-comparison rate" with "the corpus-wide rate reported in §IV-I"; replaced
  "(prior published operating point)" with "(alternative operating point from
  supplementary calibration evidence)" in Tables XXI; replaced "prior reporting
  and the existing literature" with "the existing literature and the
  supplementary calibration evidence."

MINORS:
- Line 116 Bayes-optimal qualifier: "the local density minimum ... is the
  Bayes-optimal decision boundary under equal priors" -> "In idealized
  two-class mixture settings with equal priors and equal misclassification
  costs, the local density minimum ... coincides with the Bayes-optimal
  decision boundary."
- Stale section refs: §V-G for the fine-tuning caveat retargeted to §V-H
  Engineering-level caveats (where it lives after the §V-H reorganisation);
  §III-L for the worst-case rule retargeted to §III-H.1; "Section IV-D.2"
  (nonexistent) retargeted to "Section IV-D Table VI."
- Abstract / Introduction "after pool-size adjustment": separated the
  document-level D2 proxy ICCR claim from the per-signature logistic regression
  claim. Now: "Per-document D2 inter-CPA proxy ICCRs differ by an order of
  magnitude across firms ... a per-signature logistic regression confirms the
  firm gap persists after pool-size control."

NIT:
- Related Work HTML comment "(see paper_a_references_v3.md for full list)"
  -> "(full list in the References section)"; removes the version-coded
  filename reference from the source.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1312 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-15 18:28:14 +08:00
parent b6913d2f93
commit 9e68f2e1d3
10 changed files with 108 additions and 112 deletions
+15 -15
View File
@@ -29,7 +29,8 @@ Each report is a multi-page PDF document containing, among other content, the au
CPA names, affiliated accounting firms, and audit engagement tenure were obtained from a publicly available audit-firm tenure registry encompassing 758 unique CPAs across 15 document types, with the majority (86.4%) being standard audit reports.
Table I summarizes the dataset composition.
<!-- TABLE I: Dataset Summary
**Table I.** Dataset Summary.
| Attribute | Value |
|-----------|-------|
| Total PDF documents | 90,282 |
@@ -38,7 +39,6 @@ Table I summarizes the dataset composition.
| Processed for signature extraction | 86,071 (95.3%) |
| Unique CPAs identified | 758 |
| Accounting firms | >50 |
-->
## C. Signature Page Identification
@@ -62,14 +62,14 @@ A region was labeled as "signature" if it contained any Chinese handwritten cont
The model was trained for 100 epochs on a 425/75 training/validation split with COCO pre-trained initialization, achieving strong detection performance (Table II).
<!-- TABLE II: YOLO Detection Performance
**Table II.** YOLO Detection Performance.
| Metric | Value |
|--------|-------|
| Precision | 0.970.98 |
| Recall | 0.950.98 |
| mAP@0.50 | 0.980.99 |
| mAP@0.50:0.95 | 0.850.90 |
-->
Batch inference on all 86,071 documents extracted 182,328 signature images at a rate of 43.1 documents per second (8 workers).
A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
@@ -84,7 +84,7 @@ The final classification layer was removed, yielding the 2048-dimensional output
Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.
The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G).
The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-H, Engineering-level caveats).
This design choice is supported by an ablation study (Section IV-L) comparing ResNet-50 against VGG-16 and EfficientNet-B0.
## F. Dual-Method Similarity Descriptors
@@ -109,7 +109,7 @@ Non-hand-signing is expected to yield extreme similarity under *both* descriptor
Hand-signing, by contrast, often yields high dHash similarity (the overall layout of a signature is typically preserved across writing occasions) but measurably lower cosine similarity (fine execution varies).
Convergence of the two descriptors is therefore a natural robustness check; when they disagree, the case is flagged as borderline.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. Appendix B contains the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
We do not use SSIM (Structural Similarity Index) [30] or pixel-level comparison as primary descriptors. SSIM was developed as a perceptual quality index for natural images and is by construction sensitive to the local-luminance and local-contrast perturbations routine in a print-scan cycle (JPEG block artefacts, scan-noise speckle, scanner-rule ghosts) — properties that penalise identically-reproduced signature crops at the very margins SSIM is designed to weight most heavily. Pixel-level distances ($L_1$, $L_2$, pixel-identity counting) are defined on geometrically aligned images at a common resolution and inflate under the sub-pixel offsets that scanner DPI, paper-handling alignment, and PDF-page rasterisation routinely introduce, so two scans of the same physical document cannot score near-identically. The supplementary materials contain the full design-level argument; pixel-identity counting is retained only as a threshold-free positive anchor (§III-K), because byte-identical pairs are necessarily produced by literal file reuse and so do not interact with the alignment-fragility argument.
Cosine similarity on L2-normalised deep embeddings and dHash both remain stable across the print-scan-rasterise cycle by design [14], [19], [21], [27]; together they constitute the dual descriptor used throughout the rest of this paper.
@@ -149,7 +149,7 @@ Each Big-4 signature is assigned to one of five categories using the per-signatu
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (Appendix B). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
Document-level labels are aggregated via the worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH). The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) retain their prior calibration provenance (see supplementary materials). These thresholds define the deployed screening rule; the present analysis does not re-derive them as optimal cutoffs but characterises their behaviour under inter-CPA coincidence anchors (developed in §III-L).
The remainder of this section (§III-H.2) describes the reference populations used to calibrate and cross-check this rule. §III-I demonstrates that the descriptor distributions do not provide a within-population natural threshold; §III-J–§III-K develop the descriptive partition and internal-consistency cross-checks; §III-L develops the anchor-based threshold calibration; §III-M discloses the unsupervised-setting limits.
@@ -157,7 +157,7 @@ The remainder of this section (§III-H.2) describes the reference populations us
The calibration distinguishes two reference populations: Firm A as a within-Big-4 templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference for internal-consistency checking.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (Appendix B) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). Byte-level decomposition of these signatures (see supplementary materials) identifies 145 Firm A pixel-identical signatures, spanning 50 distinct Firm A partners of the 180 registered, with 35 byte-identical matches occurring across different fiscal years; the 145 are the Firm A portion of the 262 byte-identical Big-4 signatures.
Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.
@@ -259,7 +259,7 @@ We read this as the strongest internal-consistency signal in the analysis: three
| Deployed binary high-confidence box rule vs per-signature K=3 hard label | $0.559$ |
| Per-CPA K=3 vs per-signature K=3 | $0.870$ |
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in Appendix B and not regenerated on the Big-4 subset.
The $\kappa = 0.870$ between per-CPA-fit and per-signature-fit K=3 binary labels indicates that per-CPA aggregation does not collapse the broad three-component ordering. The lower $\kappa = 0.56\text{}0.66$ between the binary box rule and either K=3 fit is consistent with two factors: different decision geometries (rectangular box vs Gaussian-mixture posterior boundary), and the fact that the binary box rule is a strict subset of the five-way rule. This comparison checks only the binary high-confidence rule (cos $> 0.95$ AND dHash $\leq 5$); §III-K does not directly check the five-way rule's `5 < \text{dHash} \leq 15` moderate-confidence band, whose calibration and capture-rate evidence is reported in the supplementary materials and not regenerated on the Big-4 subset.
**3. Leave-one-firm-out reproducibility (Scripts 36, 37).** Discussed in §III-J above. We summarise the joint result for cross-reference:
@@ -285,7 +285,7 @@ The operational classifier defined in §III-H.1 is calibrated by characterising
### L.0. Calibration methodology
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with prior reporting and with the existing literature. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain the prior calibration; the present calibration does not provide independent rates for those sub-bands.
**Calibration role of the present analysis.** The deployed thresholds of §III-H.1 preserve continuity with the existing literature and the supplementary calibration evidence. §III-I.4 establishes that a recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 below characterises the cosine threshold's specificity-proxy behaviour at the inter-CPA pair level and the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. The sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain their supplementary calibration evidence; the present calibration does not provide independent rates for those sub-bands.
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
@@ -304,7 +304,7 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatu
| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
|---|---|---|
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
| Cosine $> 0.945$ (prior published operating point) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.945$ (alternative operating point from supplementary calibration evidence) | $0.00081$ | $[0.00073, 0.00089]$ |
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
@@ -314,7 +314,7 @@ We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatu
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | $[0.00011, 0.00018]$ |
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | $[0.00008, 0.00014]$ |
The cosine row at $\text{cos} > 0.95$ is consistent with a prior published per-comparison rate of $0.0005$ on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that earlier precision. The dHash row and joint row are reported here for the first time on this corpus; the prior calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
The cosine row at $\text{cos} > 0.95$ is consistent with the corpus-wide per-comparison rate of $0.0005$ reported in §IV-I on a similarly-sized inter-CPA sample; the present $5 \times 10^5$-pair sample yields $0.00060$, within that precision. The dHash row and joint row are reported here for the first time; the corpus-wide spike did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
@@ -393,7 +393,7 @@ The per-decile per-firm breakdown (Script 44) confirms the pattern: within every
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (Appendix B; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
**Interpretation.** Under the deployed any-pair rule, the within-firm collision concentration is $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D — Firm A's pattern is markedly more within-firm-concentrated than the other three firms', though every Big-4 firm still has more than three quarters of its any-pair collisions falling on candidates within the same firm. The stricter same-pair joint event — a single candidate satisfying both cos $> 0.95$ and dHash $\leq 5$ — saturates at $97.0$$99.96\%$ within-firm across all four firms. This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. Byte-level decomposition of Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners (supplementary materials; §III-H.2) provides direct evidence of image-level reuse among Firm A signatures; the distribution across many partners is consistent with a firm-level template or production workflow, and the broader inter-CPA collision pattern in §III-L.4 is consistent with similar, milder production-related reuse patterns at Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where the within-firm collision concentration is the dominant pattern at all four Big-4 firms — most strongly at Firm A ($98.8\%$ any-pair, $99.96\%$ same-pair) and at materially lower but still majority levels at Firms B/C/D ($76.7$$83.7\%$ any-pair; $97.0$$98.2\%$ same-pair).
@@ -418,7 +418,7 @@ The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
## M. Validation Strategy and Limitations under Unsupervised Setting
## M. Unsupervised Diagnostic Strategy and Limits
The corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.
@@ -441,7 +441,7 @@ Each diagnostic reported in this paper therefore addresses one specific failure
No single diagnostic provides ground-truth validation; together they define the limits of what can be supported in this corpus without signature-level ground truth.
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, Appendix B) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Limits of the present analysis.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; byte-level evidence of Firm A's pixel-identical signatures across partners, supplementary materials) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
**Scope of the present analysis.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.