From ef0e417257600cb7617c9520379e535318b41af4 Mon Sep 17 00:00:00 2001 From: gbanyan Date: Fri, 24 Apr 2026 21:21:37 +0800 Subject: [PATCH] Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues; codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught one additional cosine-P95 ambiguity Opus missed (methodology L255). Total 12 text-only edits across 5 files. MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite the v3.12-corrected Section III-L but still wrote "P95" (self- contradiction). Fix: methodology L165 and results L247 both restated as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5% complement spelled out. MINOR findings and fixes: - m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2 L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both sites now say "every auditor-year ... across all firms." - m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21 now add "of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2" parenthetical to avoid the misleading 180−171=9 reading. - m3 IV-H.1 A2 citation: results L286 now explicitly invokes the A2 within-year label-uniformity convention (Section III-G) when reading the left-tail share as a partner-level "minority of hand- signers." - m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H → Section III-L anchor, and added explicit note that the 0.95 heuristic is a whole-sample anchor while Table XI thresholds are calibration-fold-derived (cosine P5 = 0.9407). - m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap: results L406 now explains the 4-report difference (XVI restricts to both-signers-Firm-A single-firm two-signer reports; XVII counts at-least-one-Firm-A signer under the 84,386-document cohort). - m6 Methodology L156 "four independent quantitative analyses" actually enumerated 6 items: rephrased as "three primary independent quantitative analyses plus a fourth strand comprising three complementary checks." - m7 Abstract "cluster into three groups" restored the "smoothly- mixed" qualifier to match Discussion V-B and Conclusion L17. - Codex-caught residue at methodology L255 ("Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions") grammatically applied P95 to cosine too. Rewrote as "cosine median, P1, and P5 (lower-tail) and dHash_indep median and P95 (upper-tail)" matching Table XI L233 exactly. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 249/250 words after smoothly-mixed qualifier. Co-Authored-By: Claude Opus 4.7 (1M context) --- paper/paper_a_abstract_v3.md | 2 +- paper/paper_a_conclusion_v3.md | 2 +- paper/paper_a_introduction_v3.md | 2 +- paper/paper_a_methodology_v3.md | 8 ++++---- paper/paper_a_results_v3.md | 10 +++++----- 5 files changed, 12 insertions(+), 12 deletions(-) diff --git a/paper/paper_a_abstract_v3.md b/paper/paper_a_abstract_v3.md index a50263d..8aae155 100644 --- a/paper/paper_a_abstract_v3.md +++ b/paper/paper_a_abstract_v3.md @@ -2,6 +2,6 @@ -Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---through administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present a pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply two estimators---kernel-density antimode with a Hartigan unimodality test and an EM-fitted Beta mixture with a logit-Gaussian robustness check---plus a Burgstahler-Dichev/McCrary density-smoothness diagnostic, at the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing alongside within-firm heterogeneity consistent with a minority of hand-signers; capture rates on both 70/30 calibration and held-out folds are reported with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds. +Regulations require Certified Public Accountants (CPAs) to attest to each audit report by affixing a signature. Digitization makes reusing a stored signature image across reports trivial---through administrative stamping or firm-level electronic signing---potentially undermining individualized attestation. Unlike forgery, *non-hand-signed* reproduction reuses the legitimate signer's own stored image, making it visually invisible to report users and infeasible to audit at scale manually. We present a pipeline integrating a Vision-Language Model for signature-page identification, YOLOv11 for signature detection, and ResNet-50 for feature extraction, followed by dual-descriptor verification combining cosine similarity and difference hashing. For threshold determination we apply two estimators---kernel-density antimode with a Hartigan unimodality test and an EM-fitted Beta mixture with a logit-Gaussian robustness check---plus a Burgstahler-Dichev/McCrary density-smoothness diagnostic, at the signature and accountant levels. Applied to 90,282 audit reports filed in Taiwan over 2013-2023 (182,328 signatures from 758 CPAs), the methods reveal a level asymmetry: signature-level similarity is a continuous quality spectrum that no two-component mixture separates, while accountant-level aggregates cluster into three smoothly-mixed groups with the antimode and two mixture estimators converging within $\sim$0.006 at cosine $\approx 0.975$. A major Big-4 firm is used as a *replication-dominated* (not pure) calibration anchor, with visual inspection and accountant-level mixture evidence supporting majority non-hand-signing alongside within-firm heterogeneity consistent with a minority of hand-signers; capture rates on both 70/30 calibration and held-out folds are reported with Wilson 95% intervals to make fold-level variance visible. Validation against 310 byte-identical positives and a $\sim$50,000-pair inter-CPA negative anchor yields FAR $\leq$ 0.001 at all accountant-level thresholds. diff --git a/paper/paper_a_conclusion_v3.md b/paper/paper_a_conclusion_v3.md index 3371aa0..13334fb 100644 --- a/paper/paper_a_conclusion_v3.md +++ b/paper/paper_a_conclusion_v3.md @@ -18,7 +18,7 @@ The substantive reading is therefore narrower than "discrete behavior": *pixel-l Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor. To document the within-firm sampling variance of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report capture rates on both folds with Wilson 95% confidence intervals; extreme rules agree across folds while rules in the operational 85-95% capture band differ by 1-5 percentage points, reflecting within-firm heterogeneity in replication intensity rather than generalization failure. -This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters. +This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2), the 139 / 32 split between the high-replication and middle-band clusters. An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task. diff --git a/paper/paper_a_introduction_v3.md b/paper/paper_a_introduction_v3.md index cd043cf..63abff2 100644 --- a/paper/paper_a_introduction_v3.md +++ b/paper/paper_a_introduction_v3.md @@ -51,7 +51,7 @@ By requiring convergent evidence from both descriptors, we can differentiate *st A second distinctive feature is our framing of the calibration reference. One major Big-4 accounting firm in Taiwan (hereafter "Firm A") is widely recognized within the audit profession as making substantial use of non-hand-signing for the majority of its certifying partners, while not ruling out that a minority may continue to hand-sign some reports. We therefore treat Firm A as a *replication-dominated* calibration reference rather than a pure positive class. -This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in total) cluster into an accountant-level "middle band" rather than the high-replication mode. +This framing is important because the statistical signature of a replication-dominated population is visible in our data: Firm A's per-signature cosine distribution is unimodal with a long left tail, 92.5% of Firm A signatures exceed cosine 0.95 but 7.5% fall below, and 32 of the 171 Firm A CPAs with enough signatures to enter our accountant-level analysis (of 180 Firm A CPAs in the registry; 178 after excluding two with disambiguation ties, see Section IV-G.2) cluster into an accountant-level "middle band" rather than the high-replication mode. Adopting the replication-dominated framing---rather than a near-universal framing that would have to absorb these residuals as noise---ensures internal coherence among the visual-inspection evidence, the signature-level statistics, and the accountant-level mixture. A third distinctive feature is our unit-of-analysis treatment. diff --git a/paper/paper_a_methodology_v3.md b/paper/paper_a_methodology_v3.md index d16fae6..3e1c9e9 100644 --- a/paper/paper_a_methodology_v3.md +++ b/paper/paper_a_methodology_v3.md @@ -153,7 +153,7 @@ Rather than treating Firm A as a synthetic or laboratory positive control, we tr The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports. We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show. -We establish Firm A's replication-dominated status through four independent quantitative analyses, each of which can be reproduced from the public audit-report corpus alone: +We establish Firm A's replication-dominated status through three primary independent quantitative analyses plus a fourth strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone: First, *independent visual inspection* of randomly sampled Firm A reports reveals pixel-identical signature images across different audit engagements and fiscal years for the majority of partners---a physical impossibility under independent hand-signing events. @@ -162,8 +162,8 @@ Second, *whole-sample signature-level rates*: 92.5% of Firm A's per-signature be Third, *accountant-level mixture analysis* (Section IV-E): a BIC-selected three-component Gaussian mixture over per-accountant mean cosine and mean dHash places 139 of the 171 Firm A CPAs (with $\geq 10$ signatures) in the high-replication C1 cluster and 32 in the middle-band C2 cluster, directly quantifying the within-firm heterogeneity. Fourth, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-H. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output: - (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P95 of the per-signature cosine distribution (Section III-L); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year. - (b) *Partner-level similarity ranking (Section IV-H.2).* When every Big-4 auditor-year is ranked globally by its per-auditor-year mean best-match cosine, Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff. + (a) *Longitudinal stability (Section IV-H.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-L; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year. + (b) *Partner-level similarity ranking (Section IV-H.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff. (c) *Intra-report consistency (Section IV-H.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding. We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the visual inspection, the accountant-level mixture, the three complementary analyses above, and the held-out Firm A fold (which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2) described in Section III-K. @@ -252,7 +252,7 @@ This anchor is substantially larger than a simple low-similarity-same-CPA negati 3. **Firm A anchor (replication-dominated prior positive):** Firm A signatures, treated as a majority-positive reference with within-firm heterogeneity in the left tail (consistent with a minority of hand-signers), as evidenced by the 32/171 middle-band share in the accountant-level mixture (Section III-H). Because Firm A is both used for empirical percentile calibration in Section III-H and as a validation anchor, we make the within-Firm-A sampling variance visible by splitting Firm A CPAs randomly (at the CPA level, not the signature level) into a 70% *calibration* fold and a 30% *heldout* fold. -Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions are derived from the calibration fold only. +The calibration-fold percentiles used in thresholding---cosine median, P1, and P5 (lower-tail, since higher cosine indicates greater similarity), and dHash_indep median and P95 (upper-tail, since lower dHash indicates greater similarity)---are derived from the 70% calibration fold only. The heldout fold is used exclusively to report post-hoc capture rates with Wilson 95% confidence intervals. 4. **Low-similarity same-CPA anchor (supplementary negative):** signatures whose maximum same-CPA cosine similarity is below 0.70. diff --git a/paper/paper_a_results_v3.md b/paper/paper_a_results_v3.md index 5e52e61..8406858 100644 --- a/paper/paper_a_results_v3.md +++ b/paper/paper_a_results_v3.md @@ -174,7 +174,7 @@ Table IX reports the proportion of Firm A signatures crossing each candidate thr All rates computed exactly from the full Firm A sample (N = 60,448 signatures); counts reproduce from `signature_analysis/24_validation_recalibration.py` (whole_firm_a section). --> -Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to Firm A via the calibration described in Section III-H. +Table IX is a whole-sample consistency check rather than an external validation: the thresholds 0.95, dHash median, and dHash 95th percentile are themselves anchored to the whole-sample Firm A distribution described in Section III-L (the 70/30 calibration-fold thresholds of Table XI are separate and slightly different, e.g., calibration-fold cosine P5 = 0.9407 rather than the whole-sample heuristic 0.95). The dual rule cosine $> 0.95$ AND dHash $\leq 8$ captures 89.95% of Firm A, a value that is consistent with both the accountant-level crossings (Section IV-E) and the 139/32 high-replication versus middle-band split within Firm A (Section IV-E). Section IV-G reports the corresponding rates on the 30% Firm A hold-out fold, which provides the external check these whole-sample rates cannot. @@ -244,7 +244,7 @@ We therefore interpret the held-out fold as confirming the qualitative finding ( ### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$ -The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P95 heuristic. +The per-signature classifier (Section III-L) uses cos $> 0.95$ as its operational cosine cut, anchored on the whole-sample Firm A P7.5 heuristic (i.e., 7.5% of whole-sample Firm A signatures lie at or below 0.95; see Section III-L). The accountant-level convergent threshold analysis (Section IV-E) places the primary accountant-level reference between $0.973$ and $0.979$ (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing), and the accountant-level 2D-GMM marginal at $0.945$. Because the classifier operates at the signature level while these convergent accountant-level estimates are at the accountant level, they are formally non-substitutable. We report a sensitivity check in which the classifier's operational cut cos $> 0.95$ is replaced by the nearest accountant-level reference, cos $> 0.945$. @@ -283,7 +283,7 @@ Subsection H.3 applies the calibrated classifier and is therefore a consistency ### 1) Year-by-Year Stability of the Firm A Left Tail Table XIII reports the proportion of Firm A signatures with per-signature best-match cosine below 0.95, disaggregated by fiscal year. -Under the replication-dominated interpretation (Section III-H) this left-tail share captures the minority of Firm A partners who continue to hand-sign. +Under the replication-dominated interpretation (Section III-H) and the within-year label-uniformity convention A2 (Section III-G), this left-tail share is read as a partner-level minority of Firm A CPAs who continue to hand-sign rather than as a bare signature-level rate. Under the alternative hypothesis that the left tail is an artifact of scan or compression noise, the share should shrink as scanning and PDF-compression technology improved over 2013-2023.