Paper A v3.14: remove A2 assumption + soften all partner-level claims

The within-auditor-year uniformity assumption (A2) introduced in v3.11 Section III-G was empirically tested via a new within-year uniformity check (signature_analysis/27_within_year_uniformity.py; output in reports/within_year_uniformity/). The check found that within-year pairwise cosine distributions even at the calibration firm show substantial heterogeneity inconsistent with strict single-mechanism uniformity (Firm A 2023 CPAs typically have median pairwise cosine around 0.85 with 20-70% of pairs below the all-pairs KDE crossover 0.837). A2 as stated ("a CPA who replicates any signature image in that year is treated as doing so for every report") is therefore falsified empirically. Three explanations are compatible with the data and cannot be disambiguated without manual inspection: (i) true within-year mechanism mixing, (ii) multi-template replication workflows at the same firm within a year, (iii) feature-extraction noise on repeatedly scanned stamped images. Since A2 is falsified and its implications cannot be restored under any of the three explanations, we remove A2 entirely rather than downgrading it to an "approximation" or "interpretive convention." Changes applied: 1. Methodology Section III-G: A2 block deleted. Section now has only A1 (pair-detectability, cross-year pair-existence). Replaced A2 with an explicit statement that we make no within-year or across-year uniformity assumption, that per-signature labels are signature-level quantities throughout, and that we abstain from partner-level frequency inferences. Three candidate explanations for within-year signature heterogeneity are listed (single-template replication, multi-template replication in parallel, within-year mixing, or combinations) without attempting disaggregation. 2. Methodology III-H strand 2 (L154) softened: "7.5% form a long left tail consistent with a minority of hand-signers" rewritten as reflecting "within-firm heterogeneity in signing output (we do not disaggregate partner-level mechanism here; see Section III-G)." 3. Methodology III-H visual-inspection strand (L152) and the corresponding Discussion V-C first strand (L41) and Conclusion L21 softened: "for the majority of partners" changed to "for many of the sampled partners" (Codex round-14 MAJOR: "majority of partners" is itself a partner-level frequency claim under the new scope-of- claims regime). 4. Methodology III-K.3 Firm A anchor (L247): dropped "(consistent with a minority of hand-signers)" parenthetical. 5. Results IV-D cosine distribution narrative (L72): softened to "within-firm heterogeneity in signing outputs (see Section IV-E and Section III-G for the scope of partner-level claims)." 6. Results IV-E cluster split framing (L128): "minority-hand-signers framing of Section III-H" renamed to "within-firm heterogeneity framing of Section III-H" (matches the new III-H text). 7. Results IV-H.1 partner-level reading (L286): removed entirely. The v3.13 text "Under the within-year label-uniformity convention A2, this left-tail share is read as a partner-level minority of hand-signing CPAs" is replaced by a signature-level statement that explicitly lists hand-signing partners, multi-template replication, or a combination as possibilities without attempting attribution. 8. Results IV-H.1 stability argument (L308): softened from "persistent minority of hand-signing Firm A partners" to "persistent within- firm heterogeneity component," preserving the substantive argument that stability across production technologies is inconsistent with a noise-only explanation. 9. Results IV-I Firm A Capture Profile (L407): rewrote the "Firm A's minority hand-signers have not been captured" phrasing as a signature-level framing about the 7.5% left tail not projecting into the lowest-cosine document-level category under the dual- descriptor rules. 10. Abstract (L5): softened "alongside within-firm heterogeneity consistent with a minority of hand-signers" to "alongside residual within-firm heterogeneity." Abstract at 244/250 words. 11. Discussion V-C third strand (L43): added "multi-template replication workflows" to the list of possibilities and added a local "we do not disaggregate these mechanisms; see Section III-G for the scope of claims" disclaimer (Codex round-14 MINOR 5). 12. Discussion Limitations: added an Eighth limitation explicitly stating that partner-level frequency inferences are not made and why (no within-year uniformity assumption is adopted). 13. Methodology L124 opening: "We make one stipulation about within- auditor-year structure" fixed to "same-CPA pair detectability," since A1 is a cross-year pair-existence property, not a within- year claim (Codex round-14 MINOR 3). 14. Two broken cross-references fixed (Codex round-14 MINOR 6): methodology L86 Section V-D -> V-G (Limitations is G, not D which is Style-Replication Gap); methodology L167 Section III-I -> Section IV-D (the empirical cosine distribution is in IV-D, not III-I). Script 27 and its output (reports/within_year_uniformity/*) remain in the repository as internal due-diligence evidence but are not cited from the paper. The paper's substantive claims at signature- level and accountant (cross-year pooled) level are unchanged; only the partner-level interpretive overlay is removed. All tables (IV-XVIII), Appendix A (BD/McCrary sensitivity), and all reported numbers are unchanged. Codex round-14 (gpt-5.5 xhigh) verification: Major Revision caused by one BLOCKER (stale DOCX artifact, not part of this commit) plus one MAJOR ("majority of partners" partner-frequency claim) plus four MINOR findings. All five markdown findings addressed in this commit. DOCX regeneration deferred to pre-submission packaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 22:06:22 +08:00
parent ef0e417257
commit d3b63fc0b7
5 changed files with 25 additions and 26 deletions
@@ -83,7 +83,7 @@ The final classification layer was removed, yielding the 2048-dimensional output
 Preprocessing consisted of resizing to 224×224 pixels with aspect-ratio preservation and white padding, followed by ImageNet channel normalization.
 All feature vectors were L2-normalized, ensuring that cosine similarity equals the dot product.

-The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-D).
+The choice of ResNet-50 without fine-tuning was motivated by three considerations: (1) the task is similarity comparison rather than classification, making general-purpose discriminative features sufficient; (2) ImageNet features have been shown to transfer effectively to document analysis tasks [20], [21]; and (3) avoiding domain-specific fine-tuning reduces the risk of overfitting to dataset-specific artifacts, though we note that a fine-tuned model could potentially improve discriminative performance (see Section V-G).
 This design choice is validated by an ablation study (Section IV-J) comparing ResNet-50 against VGG-16 and EfficientNet-B0.

 ## F. Dual-Method Similarity Descriptors
@@ -121,24 +121,18 @@ For per-signature classification we compute, for each signature, the maximum pai
 The max/min (rather than mean) formulation reflects the identification logic for non-hand-signing: if even one other signature of the same CPA is a pixel-level reproduction, that pair will dominate the extremes and reveal the non-hand-signed mechanism.
 Mean statistics would dilute this signal.

-We distinguish two stipulations by the role each plays, in order to avoid overstating the paper's reliance on them.
+We make one stipulation about same-CPA pair detectability.

 **(A1) Pair-detectability** is a statistical assumption scoped to the same-CPA pool (pooled across fiscal years, matching the max/min computation above): if a CPA uses image replication anywhere in the corpus, at least one pair of same-CPA signatures is near-identical after reproduction noise, so that max cosine / min dHash detects the replication.
 This is plausible for high-volume stamping or firm-level electronic-signing workflows---where a stored image is typically reused many times under similar scan and compression conditions---but is not guaranteed in sparse CPA-corpora with only one observed replicated report, when multiple template variants are in use, or when scan-stage noise pushes a replicated pair outside the detection regime.
 A1 is what the per-signature detector requires to be sensitive to replication; it is a cross-year pair-existence property, not a within-year uniformity claim.

-**(A2) Within-year label uniformity** is an interpretive convention used when a signature-level label is *read as* "this CPA's signing mechanism for that fiscal year": within any single fiscal year we treat the CPA's mechanism as uniform, i.e., a CPA who replicates any signature image in that year is treated as doing so for every report in that year, and a hand-signer is treated as hand-signing every report in that year.
-A2 is consistent with industry practice at Firm A during the sample period, but may weaken at other Big-4 firms during the 2019--2021 digitalization-transition years, in which a CPA's mechanism could in principle shift mid-year as firm-level electronic-signing systems were rolled out.
-We therefore read A2 as a domain-motivated default rather than a universally validated empirical claim.
-The arithmetic statistics reported in this paper do not require A2 for their definition or computation: the per-signature classifier (Section III-L) operates at signature level, the accountant-level mixture (Section III-J) uses mean statistics over the full same-CPA pool, and the partner-level ranking (Section IV-H.2) uses a per-auditor-year mean---none of which require within-year uniformity to be well-defined.
-A2 does, however, underwrite certain *interpretive* readings---most notably, the framing in Section IV-H.1 of Firm A's yearly left-tail share as a partner-level "minority of hand-signers" rather than a bare signature-level rate---and the downstream use of per-signature or per-auditor-year labels as regime labels for auditor-behavior research.
+We make *no* within-year or across-year uniformity assumption about CPA signing mechanisms.
+Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation.
+A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.
+The accountant-level summary statistics of Section III-J are likewise cross-year pooled quantities by construction, and may blend distinct signing-mechanism regimes when a CPA's practice changes over the sample period; we treat this as a design choice, not an identification assumption, and the accountant-level aggregates are to be read as characterizing each CPA's pooled observed tendency over the full sample period rather than a single time-invariant regime.

-We explicitly *do not* assume across-year homogeneity.
-A CPA's mechanism may change across fiscal years---the 2019--2021 Big-4 digitalization trends documented in Section IV-H are consistent with such changes---and accountant-level summary statistics (Section III-J) therefore represent a cross-year pooled summary that may blend multiple regimes for the same CPA.
-We treat this as a design choice: the accountant-level aggregates characterize each CPA's overall distribution over the full sample period, not a single time-invariant regime.
-
-The intra-report consistency analysis in Section IV-H.3 is a related but distinct check: it tests whether the *two co-signing CPAs on the same report* receive the same signature-level label (firm-level signing-practice homogeneity) rather than testing A2 at the same-CPA level.
-A direct empirical check of A2 would require labeling multiple reports of the same CPA in the same year and is left to future work; as noted above, no reported statistic relies on A2, and A2's interpretive scope is further bounded by the worst-case aggregation rule of Section III-L.
+The intra-report consistency analysis in Section IV-H.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.

 For accountant-level analysis we additionally aggregate these per-signature statistics to the CPA level by computing the mean best-match cosine and the mean *independent minimum dHash* across all signatures of that CPA.
 The *independent minimum dHash* of a signature is defined as the minimum Hamming distance to *any* other signature of the same CPA (over the full same-CPA set).
@@ -155,9 +149,9 @@ We use this only as background context for why Firm A is a plausible calibration

 We establish Firm A's replication-dominated status through three primary independent quantitative analyses plus a fourth strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:

-First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events.
+First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for many of the sampled partners---a physical impossibility under independent hand-signing events.

-Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% form a long left tail consistent with a minority of hand-signers.
+Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95, consistent with non-hand-signing as the dominant mechanism, while the remaining 7.5% form a long left tail reflecting within-firm heterogeneity in signing output (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims).

 Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity.

@@ -170,7 +164,7 @@ We emphasize that the 92.5% figure is a within-sample consistency check rather t

 We emphasize that Firm A's replication-dominated status was *not* derived from the thresholds we calibrate against it.
 Its identification rests on visual evidence and accountant-level clustering that is independent of the statistical pipeline.
-The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section III-I below)---and for avoiding overclaim in downstream inference.
+The "replication-dominated, not pure" framing is important both for internal consistency---it predicts and explains the long left tail observed in Firm A's cosine distribution (Section IV-D)---and for avoiding overclaim in downstream inference.

 ## I. Convergent Threshold Determination with a Density-Smoothness Diagnostic

@@ -250,7 +244,7 @@ We further emphasize that this anchor is a *subset* of the true positive class--
 Inter-CPA pairs cannot arise from reuse of a single signer's stored signature image, so this population is a reliable negative class for threshold sweeps.
 This anchor is substantially larger than a simple low-similarity-same-CPA negative and yields tight Wilson 95% confidence intervals on FAR at each candidate threshold.

-3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail (consistent with a minority of hand-signers), as evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
+3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail, as evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H).
 Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold.
 The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only.
 The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals.