Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings

Codex independent peer review (paper/codex_review_gpt55_v3_18_1.md) audited
empirical claims against scripts/JSON reports rather than rubber-stamping
prior Accept verdicts. Verdict: Minor Revision. This commit addresses every
flagged item.

- Soften mechanism-identification language (Results IV-D.1, Discussion B):
  per-signature cosine "fails to reject unimodality" rather than "reflects a
  single dominant generative mechanism"; framing tied to joint evidence.
- Replace overabsolute "single stored image" with multi-template phrasing
  in Introduction and Methodology III-A.
- Reframe Methodology III-H so practitioner knowledge is non-load-bearing;
  evidentiary basis is the paper's own image evidence.
- Fix stale section cross-references after the v3.18 retitling: IV-F.* ->
  IV-G.* in 11 locations across methodology and results.
- Fix 0.941 / 0.945 / 0.9407 wording in Methodology III-K to use the
  calibration-fold P5 = 0.9407 and the rounded sensitivity cut 0.945.
- Soften "sharp discontinuity" in Results IV-G.3 to "23-28 percentage-point
  gap consistent with firm-wide non-hand-signing practice".
- Soften Conclusion's "directly generalizable" with explicit conditions on
  analogous anchors and artifact-generation physics.
- Add Appendix B: table-to-script provenance map (15 manuscript tables
  mapped to generating scripts and JSON report artifacts).
- New script signature_analysis/28_byte_identity_decomposition.py produces
  reproducible artifacts for two previously-unverified claims:
  (a) 145 / 50 / 180 / 35 Firm A byte-identity decomposition (verified);
  (b) cross-firm dual-descriptor convergence -- corrected from the previous
      manuscript text "non-Firm-A 11.3% vs Firm A 58.7% (5x)" to the
      database-verified "non-Firm-A 42.12% vs Firm A 88.32% (~2.1x)".
- Clarify scripts 19 / 21 docstrings: legacy EER / FRR / Precision / F1
  helpers are retained for diagnostic use only and are NOT cited as
  biometric performance in the paper. Remove "interview evidence" wording.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 20:23:08 +08:00
parent cb77f481ec
commit 4bb7aa9189
9 changed files with 299 additions and 53 deletions
Binary file not shown.
+27
View File
@@ -33,3 +33,30 @@ Taken together, Table A.I shows that the signature-level BD/McCrary transitions
This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that per-signature similarity does not form a clean two-mechanism mixture.
Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
# Appendix B. Table-to-Script Provenance
For reproducibility, the following table maps each numerical table in Section IV to the analysis script that produces its underlying values and to the JSON / Markdown report file emitted by that script. Scripts referenced are under `signature_analysis/` and reports under the project's `reports/` tree.
<!-- TABLE B.I: Manuscript table → reproduction artifact
| Manuscript table | Generating script | Report artifact |
|------------------|-------------------|-----------------|
| Table III (extraction results) | `02_extract_features.py`; `09_pdf_signature_verdict.py` | extraction logs (supplementary) |
| Table IV (intra/inter all-pairs cosine statistics) | `10_formal_statistical_analysis.py` | `reports/formal_statistical/formal_statistical_results.json` |
| Table V (Hartigan dip test) | `15_hartigan_dip_test.py` | `reports/dip_test/dip_test_results.json` |
| Table VI (signature-level threshold-estimator summary) | `17_beta_mixture_em.py`; `25_bd_mccrary_sensitivity.py` | `reports/beta_mixture/beta_mixture_results.json`; `reports/bd_sensitivity/bd_sensitivity.json` |
| Table IX (Firm A whole-sample capture rates) | `19_pixel_identity_validation.py`; `24_validation_recalibration.py` | `reports/pixel_validation/pixel_validation_results.json`; `reports/validation_recalibration/validation_recalibration.json` |
| Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
| Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
| Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
| Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | `reports/deloitte_distribution/deloitte_distribution_results.json` |
| Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
| Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
| Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_level/pdf_level_results.json` |
| Table XVIII (backbone ablation) | `paper/ablation_backbone_comparison.py` | `reports/ablation/ablation_results.json` |
| Table A.I (BD/McCrary bin-width sensitivity) | `25_bd_mccrary_sensitivity.py` | `reports/bd_sensitivity/bd_sensitivity.json` |
| Byte-identity decomposition (145 / 50 / 180 / 35; Section IV-F.1) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
| Cross-firm dual-descriptor convergence (Section IV-H.2) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
-->
The table-to-script mapping above is intended as a navigation aid for replicators. All scripts run deterministically under the fixed random seeds documented in the supplementary materials; report files are committed alongside the scripts so that each numerical claim in Section IV traces to a specific JSON field rather than to an undocumented intermediate computation.
+1 -1
View File
@@ -27,5 +27,5 @@ Several directions merit further investigation.
Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
Extending the analysis to auditor-year units---computing per-signature statistics within each fiscal year and tracking how individual CPAs move across years---could reveal within-CPA transitions between hand-signing and non-hand-signing over the decade and is the natural next step beyond the cross-sectional analysis reported here.
The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
+1 -1
View File
@@ -9,7 +9,7 @@ While the law permits either a handwritten signature or a seal, the CPA's attest
The digitization of financial reporting has introduced a practice that complicates this intent.
As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
We refer to signatures produced by either workflow collectively as *non-hand-signed*.
Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
+12 -12
View File
@@ -7,7 +7,7 @@ Fig. 1 illustrates the overall architecture.
The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.
Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
<!--
[Figure 1: Pipeline Architecture - clean vector diagram]
@@ -116,7 +116,7 @@ Cosine similarity and dHash are both robust to the noise introduced by the print
Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; and (ii) the *auditor-year*---all signatures by one CPA within one fiscal year.
The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G).
The auditor-year is used in the partner-level similarity ranking of Section IV-F.2 as a deliberately within-year aggregation that avoids cross-year pooling.
The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a deliberately within-year aggregation that avoids cross-year pooling.
We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.
For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year).
@@ -136,29 +136,29 @@ We make *no* within-year or across-year uniformity assumption about CPA signing
Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation.
A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.
The intra-report consistency analysis in Section IV-F.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
## H. Calibration Reference: Firm A as a Replication-Dominated Population
A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.
Practitioner knowledge motivated treating Firm A as a candidate calibration reference: it is widely held within the audit profession that the firm reproduces a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
This practitioner background is *non-load-bearing* in our analysis: the evidentiary basis used in this paper is the observable image evidence reported below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---which does not depend on any claim about signing practice beyond what the audit-report images themselves show.
We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:
First, *automated byte-level pair analysis* (Section IV-F.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
First, *automated byte-level pair analysis* (Section IV-F.1; reproduced by `signature_analysis/28_byte_identity_decomposition.py` with output in `reports/byte_identity_decomp/byte_identity_decomposition.json`) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.
Second, *signature-level distributional evidence*: Firm A's per-signature best-match cosine distribution is unimodal with a long left tail (Hartigan dip test $p = 0.17$ at $n \geq 10$ signatures; Section IV-D), consistent with a single dominant mechanism (non-hand-signing) plus residual within-firm heterogeneity rather than two cleanly separated mechanisms.
92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims).
The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-F.1).
The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).
Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-F. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
(a) *Longitudinal stability (Section IV-F.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
(b) *Partner-level similarity ranking (Section IV-F.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
(c) *Intra-report consistency (Section IV-F.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
(a) *Longitudinal stability (Section IV-G.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
(b) *Partner-level similarity ranking (Section IV-G.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
(c) *Intra-report consistency (Section IV-G.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2).
@@ -280,7 +280,7 @@ High feature-level similarity without structural corroboration---consistent with
We note three conventions about the thresholds.
First, the cosine cutoff $0.95$ corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5% of whole-sample Firm A signatures exceed this cutoff and 7.5% fall at or below it (Section III-H)---chosen as a round-number lower-tail boundary whose complement (92.5% above) has a transparent interpretation in the whole-sample reference distribution; the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the slightly stricter Firm A P5 percentile $0.941$ alters aggregate firm-level capture rates by at most $\approx 1.2$ percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the nearby rounded sensitivity cut $0.945$ (motivated by the calibration-fold P5 = 0.9407, see Section IV-F.2) shifts whole-Firm-A dual-rule capture by 1.19 percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are *not* the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.
+8 -7
View File
@@ -73,9 +73,9 @@ The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-sig
| All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
-->
Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
The Firm A unimodal-long-tail finding is the structural evidence that supports the replication-dominated framing (Section III-H): a single dominant mechanism plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
@@ -204,7 +204,7 @@ Under this proper test the two extreme rules agree across folds (cosine $> 0.837
The operationally relevant rules in the 8595% capture band differ between folds by 15 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 8599% range has a held-out counterpart in the 8799% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-F.2) is the cross-check that is robust to this fold variance.
We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
@@ -332,7 +332,7 @@ A report is "in agreement" if both signature labels fall in the same coarse buck
Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.
We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
@@ -341,7 +341,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-F.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
<!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
| Verdict | N (PDFs) | % | Firm A | Firm A % |
@@ -370,8 +370,9 @@ We note that because the non-hand-signed thresholds are themselves calibrated to
### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-F.2) and intra-report consistency (Section IV-F.3) findings.
Among the 65,515 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,921 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
## I. Ablation Study: Feature Backbone Comparison