diff --git a/paper/Paper_A_IEEE_Access_Draft_v3.docx b/paper/Paper_A_IEEE_Access_Draft_v3.docx index fa800b6..91f52a0 100644 Binary files a/paper/Paper_A_IEEE_Access_Draft_v3.docx and b/paper/Paper_A_IEEE_Access_Draft_v3.docx differ diff --git a/paper/paper_a_introduction_v3.md b/paper/paper_a_introduction_v3.md index 686158a..19ec10a 100644 --- a/paper/paper_a_introduction_v3.md +++ b/paper/paper_a_introduction_v3.md @@ -70,11 +70,11 @@ The contributions of this paper are summarized as follows: 3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with perceptual hashing resolves the fundamental ambiguity between style consistency and image reproduction, and we validate the backbone choice through an ablation study comparing three feature-extraction architectures. -4. **Percentile-anchored operational threshold.** We anchor the operational classifier's cosine cut on the whole-sample Firm A P7.5 percentile (cos $> 0.95$), a transparent and reproducible reference drawn from a known-majority-positive population, and complement it with dHash structural cuts derived from the same reference distribution. Operational thresholds are therefore explained by an empirical reference rather than asserted. +4. **Percentile-anchored operational threshold.** We anchor the operational classifier's cosine cut on the whole-sample Firm A P7.5 percentile (cos $> 0.95$), a transparent and reproducible reference drawn from a replication-dominated reference population, and complement it with dHash structural cuts derived from the same reference distribution. Operational thresholds are therefore explained by an empirical reference rather than asserted. 5. **Distributional characterisation of per-signature similarity.** We apply three statistical diagnostics---a Hartigan dip test, an EM-fitted Beta mixture with logit-Gaussian robustness check, and a Burgstahler-Dichev / McCrary density-smoothness procedure---to characterise the shape of the per-signature similarity distribution. The three diagnostics jointly find that per-signature similarity forms a continuous quality spectrum, which both motivates the percentile-based operational anchor over a mixture-fit crossing and is itself a substantive finding for the document-forensics literature on similarity-threshold selection. -6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a known-majority-positive reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. +6. **Replication-dominated calibration methodology.** We introduce a calibration strategy using a replication-dominated reference group, distinguishing *replication-dominated* from *replication-pure* anchors; and we validate classification using byte-level pixel identity as an annotation-free gold positive, requiring no manual labeling. 7. **Large-scale empirical analysis.** We report findings from the analysis of over 90,000 audit reports spanning a decade, providing the first large-scale empirical evidence on non-hand-signing practices in financial reporting under a methodology designed for peer-review defensibility. diff --git a/paper/paper_a_methodology_v3.md b/paper/paper_a_methodology_v3.md index 6ab28a1..1898962 100644 --- a/paper/paper_a_methodology_v3.md +++ b/paper/paper_a_methodology_v3.md @@ -116,7 +116,7 @@ Cosine similarity and dHash are both robust to the noise introduced by the print Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; and (ii) the *auditor-year*---all signatures by one CPA within one fiscal year. The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G). -The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a deliberately within-year aggregation that avoids cross-year pooling. +The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a within-year aggregation unit: each auditor-year's mean is computed over its own fiscal-year signatures, although the per-signature best-match cosine that feeds the mean is computed against the full same-CPA cross-year pool (Section III-G's max-cosine / min-dHash definition). We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time. For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year). @@ -177,7 +177,7 @@ The two roles are kept separate by design. The reason for the split is empirical. The three statistical diagnostics jointly find that per-signature similarity forms a continuous quality spectrum (Section IV-D, summarised below): the dip test fails to reject unimodality for Firm A; BIC strongly prefers a 3-component over a 2-component Beta fit, so the 2-component crossing is a forced fit; and the BD/McCrary candidate transition lies inside the non-hand-signed mode rather than between modes (and is not bin-width-stable; Appendix A). -Under these conditions the natural anchor for an operational cosine cut is a transparent percentile of a known-majority-positive reference population (Firm A) rather than a mixture-fit crossing whose location depends on parametric assumptions the data do not support. +Under these conditions the natural anchor for an operational cosine cut is a transparent percentile of a replication-dominated reference population (Firm A) rather than a mixture-fit crossing whose location depends on parametric assumptions the data do not support. We describe the three diagnostics and the assumptions underlying each in the subsections below. The two threshold estimators rest on decreasing-in-strength assumptions: the KDE antimode/crossover requires only smoothness; the Beta mixture additionally requires a parametric specification, and the logit-Gaussian cross-check reports sensitivity to that form. diff --git a/paper/paper_a_results_v3.md b/paper/paper_a_results_v3.md index 9b06866..17096ad 100644 --- a/paper/paper_a_results_v3.md +++ b/paper/paper_a_results_v3.md @@ -149,6 +149,7 @@ We report three validation analyses corresponding to the anchors of Section III- ### 1) Pixel-Identity Positive Anchor with Inter-CPA Negative Anchor Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G. +Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B). As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$). Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation. We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X. @@ -370,9 +371,8 @@ We note that because the non-hand-signed thresholds are themselves calibrated to ### 2) Cross-Firm Comparison of Dual-Descriptor Convergence -Among the 65,515 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,921 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible. -The Firm A denominator here (55,921) differs by a single signature from Table IX's cosine-only count (55,922) because the two artifacts were materialized from successive snapshots of the underlying database: Table IX is rendered from `validation_recalibration.json` produced earlier in the analysis pipeline, while the cross-firm decomposition is rendered from `byte_identity_decomposition.json` produced more recently after a downstream feature recomputation that shifted exactly one borderline Firm A signature from `cos > 0.95` to `cos = 0.95...` at floating-point precision. -The one-record drift does not affect any reported rate to two decimal places; we retain both values to make the snapshot provenance explicit. +Among the 65,514 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,922 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible. +The Firm A denominator (55,922) matches Table IX exactly: both Table IX and the cross-firm decomposition define Firm A membership via the CPA registry (`accountants.firm`), and the cross-firm analysis additionally requires a non-null independent-min dHash record, which all 55,922 Firm A cosine-eligible signatures have in the current database. This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings. Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map). diff --git a/signature_analysis/28_byte_identity_decomposition.py b/signature_analysis/28_byte_identity_decomposition.py index 1695c3b..25b8bc6 100644 --- a/signature_analysis/28_byte_identity_decomposition.py +++ b/signature_analysis/28_byte_identity_decomposition.py @@ -16,6 +16,12 @@ lacked dedicated provenance (codex review v3.18.1 items #7 and #8): the fraction with min_dhash_independent <= 5, broken out by Firm A vs Non-Firm-A. +Firm A membership is defined throughout via accountants.firm (the CPA +registry firm) joined on signatures.assigned_accountant. This matches +the convention used by signature_analysis/24_validation_recalibration.py +and the validation_recalibration JSON, so counts are directly comparable +to Tables IX / XI / XII. + Output: /Volumes/NV2/PDF-Processing/signature-analysis/reports/byte_identity_decomp/ byte_identity_decomposition.json @@ -57,9 +63,10 @@ def byte_identity_decomposition(conn): s1.year_month AS ym_a, s2.year_month AS ym_b FROM signatures s1 + JOIN accountants a ON s1.assigned_accountant = a.name JOIN signatures s2 ON s1.closest_match_file = s2.image_filename WHERE s1.pixel_identical_to_closest = 1 - AND s1.excel_firm = ? + AND a.firm = ? ) SELECT COUNT(*) AS total_pixel_identical_firm_a, @@ -94,15 +101,15 @@ def cross_firm_dual_convergence(conn): cur.execute(""" SELECT - CASE WHEN excel_firm = ? THEN 'Firm A' ELSE 'Non-Firm-A' END + CASE WHEN a.firm = ? THEN 'Firm A' ELSE 'Non-Firm-A' END AS firm_group, COUNT(*) AS n_signatures_above_095, - SUM(CASE WHEN min_dhash_independent <= 5 THEN 1 ELSE 0 END) + SUM(CASE WHEN s.min_dhash_independent <= 5 THEN 1 ELSE 0 END) AS n_dhash_le_5 - FROM signatures - WHERE max_similarity_to_same_accountant > 0.95 - AND assigned_accountant IS NOT NULL - AND min_dhash_independent IS NOT NULL + FROM signatures s + JOIN accountants a ON s.assigned_accountant = a.name + WHERE s.max_similarity_to_same_accountant > 0.95 + AND s.min_dhash_independent IS NOT NULL GROUP BY firm_group ORDER BY firm_group """, (FIRM_A,))