pdf_signature_extraction

Author	SHA1	Message	Date
gbanyan	1dfbc5f000	Paper A v3.15: resolve Gemini 3.1 Pro round-15 Accept-verdict minor polish Gemini 3.1 Pro round-15 full-paper review of v3.14 returned Accept with four MINOR polish suggestions. All four applied in this commit. 1. Table XIII column header: "mean cosine" renamed to "mean best-match cosine" to match the underlying metric (per- signature best-match over the full same-CPA pool) and prevent readers from inferring a simpler per-year statistic. 2. Methodology III-L (L284): added a forward-pointer in the first threshold-convention note to Section IV-G.3, explicitly confirming that replacing the 0.95 round-number heuristic with the nearby accountant-level 2D-GMM marginal crossing 0.945 alters aggregate firm-level capture rates by at most ~1.2 percentage points. This pre-empts a reader who might worry about the methodological tension between the heuristic and the mixture-derived convergence band. 3. Results IV-I document-level aggregation (L383): "Document-level rates therefore bound the share..." rewritten as "represent the share..." Gemini correctly noted that worst-case aggregation directly assigns (subject to classifier error), so "bound" spuriously implies an inequality not actually present. 4. Results IV-G.4 Sanity Sample (L273): "inter-rater agreement with the classifier" rewritten as "full human--classifier agreement (30/30)". Inter-rater conventionally refers to human-vs-human agreement; human-vs-classifier is the correct term here. No substantive changes; no tables recomputed. Gemini round-15 verdict was Accept with these four items framed as nice-to-have rather than blockers; applying them brings v3.15 to a fully polished state before manual DOCX packaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 01:01:58 +08:00
gbanyan	d3b63fc0b7	Paper A v3.14: remove A2 assumption + soften all partner-level claims The within-auditor-year uniformity assumption (A2) introduced in v3.11 Section III-G was empirically tested via a new within-year uniformity check (signature_analysis/27_within_year_uniformity.py; output in reports/within_year_uniformity/). The check found that within-year pairwise cosine distributions even at the calibration firm show substantial heterogeneity inconsistent with strict single-mechanism uniformity (Firm A 2023 CPAs typically have median pairwise cosine around 0.85 with 20-70% of pairs below the all-pairs KDE crossover 0.837). A2 as stated ("a CPA who replicates any signature image in that year is treated as doing so for every report") is therefore falsified empirically. Three explanations are compatible with the data and cannot be disambiguated without manual inspection: (i) true within-year mechanism mixing, (ii) multi-template replication workflows at the same firm within a year, (iii) feature-extraction noise on repeatedly scanned stamped images. Since A2 is falsified and its implications cannot be restored under any of the three explanations, we remove A2 entirely rather than downgrading it to an "approximation" or "interpretive convention." Changes applied: 1. Methodology Section III-G: A2 block deleted. Section now has only A1 (pair-detectability, cross-year pair-existence). Replaced A2 with an explicit statement that we make no within-year or across-year uniformity assumption, that per-signature labels are signature-level quantities throughout, and that we abstain from partner-level frequency inferences. Three candidate explanations for within-year signature heterogeneity are listed (single-template replication, multi-template replication in parallel, within-year mixing, or combinations) without attempting disaggregation. 2. Methodology III-H strand 2 (L154) softened: "7.5% form a long left tail consistent with a minority of hand-signers" rewritten as reflecting "within-firm heterogeneity in signing output (we do not disaggregate partner-level mechanism here; see Section III-G)." 3. Methodology III-H visual-inspection strand (L152) and the corresponding Discussion V-C first strand (L41) and Conclusion L21 softened: "for the majority of partners" changed to "for many of the sampled partners" (Codex round-14 MAJOR: "majority of partners" is itself a partner-level frequency claim under the new scope-of- claims regime). 4. Methodology III-K.3 Firm A anchor (L247): dropped "(consistent with a minority of hand-signers)" parenthetical. 5. Results IV-D cosine distribution narrative (L72): softened to "within-firm heterogeneity in signing outputs (see Section IV-E and Section III-G for the scope of partner-level claims)." 6. Results IV-E cluster split framing (L128): "minority-hand-signers framing of Section III-H" renamed to "within-firm heterogeneity framing of Section III-H" (matches the new III-H text). 7. Results IV-H.1 partner-level reading (L286): removed entirely. The v3.13 text "Under the within-year label-uniformity convention A2, this left-tail share is read as a partner-level minority of hand-signing CPAs" is replaced by a signature-level statement that explicitly lists hand-signing partners, multi-template replication, or a combination as possibilities without attempting attribution. 8. Results IV-H.1 stability argument (L308): softened from "persistent minority of hand-signing Firm A partners" to "persistent within- firm heterogeneity component," preserving the substantive argument that stability across production technologies is inconsistent with a noise-only explanation. 9. Results IV-I Firm A Capture Profile (L407): rewrote the "Firm A's minority hand-signers have not been captured" phrasing as a signature-level framing about the 7.5% left tail not projecting into the lowest-cosine document-level category under the dual- descriptor rules. 10. Abstract (L5): softened "alongside within-firm heterogeneity consistent with a minority of hand-signers" to "alongside residual within-firm heterogeneity." Abstract at 244/250 words. 11. Discussion V-C third strand (L43): added "multi-template replication workflows" to the list of possibilities and added a local "we do not disaggregate these mechanisms; see Section III-G for the scope of claims" disclaimer (Codex round-14 MINOR 5). 12. Discussion Limitations: added an Eighth limitation explicitly stating that partner-level frequency inferences are not made and why (no within-year uniformity assumption is adopted). 13. Methodology L124 opening: "We make one stipulation about within- auditor-year structure" fixed to "same-CPA pair detectability," since A1 is a cross-year pair-existence property, not a within- year claim (Codex round-14 MINOR 3). 14. Two broken cross-references fixed (Codex round-14 MINOR 6): methodology L86 Section V-D -> V-G (Limitations is G, not D which is Style-Replication Gap); methodology L167 Section III-I -> Section IV-D (the empirical cosine distribution is in IV-D, not III-I). Script 27 and its output (reports/within_year_uniformity/*) remain in the repository as internal due-diligence evidence but are not cited from the paper. The paper's substantive claims at signature- level and accountant (cross-year pooled) level are unchanged; only the partner-level interpretive overlay is removed. All tables (IV-XVIII), Appendix A (BD/McCrary sensitivity), and all reported numbers are unchanged. Codex round-14 (gpt-5.5 xhigh) verification: Major Revision caused by one BLOCKER (stale DOCX artifact, not part of this commit) plus one MAJOR ("majority of partners" partner-frequency claim) plus four MINOR findings. All five markdown findings addressed in this commit. DOCX regeneration deferred to pre-submission packaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 22:06:22 +08:00
gbanyan	ef0e417257	Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues; codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught one additional cosine-P95 ambiguity Opus missed (methodology L255). Total 12 text-only edits across 5 files. MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite the v3.12-corrected Section III-L but still wrote "P95" (self- contradiction). Fix: methodology L165 and results L247 both restated as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5% complement spelled out. MINOR findings and fixes: - m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2 L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both sites now say "every auditor-year ... across all firms." - m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21 now add "of 180 registered CPAs; 178 after excluding two with disambiguation ties, Section IV-G.2" parenthetical to avoid the misleading 180−171=9 reading. - m3 IV-H.1 A2 citation: results L286 now explicitly invokes the A2 within-year label-uniformity convention (Section III-G) when reading the left-tail share as a partner-level "minority of hand- signers." - m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H → Section III-L anchor, and added explicit note that the 0.95 heuristic is a whole-sample anchor while Table XI thresholds are calibration-fold-derived (cosine P5 = 0.9407). - m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap: results L406 now explains the 4-report difference (XVI restricts to both-signers-Firm-A single-firm two-signer reports; XVII counts at-least-one-Firm-A signer under the 84,386-document cohort). - m6 Methodology L156 "four independent quantitative analyses" actually enumerated 6 items: rephrased as "three primary independent quantitative analyses plus a fourth strand comprising three complementary checks." - m7 Abstract "cluster into three groups" restored the "smoothly- mixed" qualifier to match Discussion V-B and Conclusion L17. - Codex-caught residue at methodology L255 ("Median, 1st percentile, and 95th percentile of signature-level cosine/dHash distributions") grammatically applied P95 to cosine too. Rewrote as "cosine median, P1, and P5 (lower-tail) and dHash_indep median and P95 (upper-tail)" matching Table XI L233 exactly. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 249/250 words after smoothly-mixed qualifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:21:37 +08:00
gbanyan	9b0b8358a2	Paper A v3.12: resolve Gemini 3.1 Pro round-11 full-paper review findings Round-11 Gemini 3.1 Pro fresh full-paper review (Minor Revision) surfaced four issues that the prior 10 rounds (codex gpt-5.4 x4, codex gpt-5.5 x1, Gemini 3.1 Pro x2, Opus 4.7 x1, paragraph-level v3.11 review) all missed: 1. MAJOR - Percentile-terminology contradiction between Section III-L L290 and Section III-H L160. III-L called 0.95 the "whole-sample Firm A P95" of the per-signature best-match cosine distribution, but III-H states 92.5% of Firm A signatures exceed 0.95. Under standard bottom-up percentile convention this makes 0.95 the P7.5, not the P95; Table XI calibration-fold data (Firm A cosine median = 0.9862, P5 = 0.9407) confirms true P95 is near 0.998. Fix: rewrote III-L L290 to state 0.95 corresponds to approximately the whole-sample Firm A P7.5 with the 92.5%/7.5% complement stated explicitly. dHash P95 claims elsewhere (Table XI, L229/L233) were already correct under standard convention and are unchanged. 2. MINOR - Firm A CPA count inconsistency. Discussion V-C L44 said "Nine additional Firm A CPAs are excluded from the GMM for having fewer than 10 signatures" but Results IV-G.2 L216 defines 178 valid Firm A CPAs (180 registry minus 2 disambiguation-excluded); 178 - 171 = 7. Fix: corrected to "seven are outside the GMM" with explicit 178-baseline and cross-reference to IV-G.2. 3. MINOR - Table XVI mixed-firm handling broken promise. Results L355-356 previously said "mixed-firm reports are reported separately" but Table XVI only lists single-firm rows summing to exactly 83,970, and no subsequent prose reports the 384 mixed-firm agreement rate. Fix: rewrote L355-356 to state Table XVI covers the 83,970 single-firm reports only and that the 384 mixed-firm reports (0.46%) are excluded because firm-level agreement is not well defined when the two signers are at different firms. 4. MINOR - Contribution-count structural inconsistency. Introduction enumerates seven contributions, Conclusion opens with "Our contributions are fourfold." Fix: rewrote the Conclusion lead to "The seven numbered contributions listed in Section I can be grouped into four broader methodological themes," making the grouping explicit. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract unchanged (still 248/250 words). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 20:10:20 +08:00
gbanyan	d2f8673a67	Paper A v3.11: reframe Section III-G unit hierarchy + propagate implications Rewrites Section III-G (Unit of Analysis and Summary Statistics) after self-review identified three logical issues in v3.10: 1. Ordering inversion: the three units are now ordered signature -> auditor-year -> accountant, with auditor-year as the principled middle unit under within-year assumptions and accountant as a deliberate cross-year pooling. 2. Oversold assumption: the old "within-auditor-year no-mixing identification assumption" is split into A1 (pair-detectability, weak statistical, cross-year scope matching the detector) and A2 (within-year label uniformity, interpretive convention). The arithmetic statistics reported in the paper do not require A2; A2 only underwrites interpretive readings (notably IV-H.1's partner- level "minority of hand-signers" framing). 3. Motivation-assumption mismatch: removed the "longitudinal behaviour of interest" framing and explicitly disclaimed across-year homogeneity. Accountant-level coordinates are now described as a pooled observed tendency rather than a time-invariant regime. Propagated implications across Introduction, Discussion, and Results: softened "tends to cluster into a dominant regime" and "directly quantifying the minority of hand-signers" to "pooled observed tendency" / "consistent with within-firm heterogeneity"; rewrote the Limitations fifth point (was "treats all signatures from a CPA as a single class"); added a seventh Limitation acknowledging the source-template edge case; added a per-signature best-match cross-year caveat to Section IV-H.2; softened IV-H.2's "direct consequence" to "consistent with"; reframed pixel-identity anchor as pair-level proof of image reuse (with source-template exception) rather than absolute signature-level positive. Process: self-review (9 findings) -> full-pass fixes -> codex gpt-5.5 xhigh round-10 verification (8 RESOLVED, 1 PARTIAL, 4 MINOR regression findings) -> regression fixes. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 248/250 words. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:52:45 +08:00
gbanyan	615059a2c1	Paper A v3.10: resolve Opus 4.7 round-9 paper-vs-Appendix-A contradiction Opus round-9 review (paper/opus_final_review_v3_9.md) dissented from Gemini round-7 Accept and aligned with codex round-8 Minor, but for a DIFFERENT issue all prior reviewers missed: the paper's main text in four locations flatly claimed the BD/McCrary accountant-level null "persists across the Appendix-A bin-width sweep", yet Appendix A Table A.I itself documents a significant accountant-level cosine transition at bin 0.005 with \|Z_below\|=3.23, \|Z_above\|=5.18 (both past 1.96) located at cosine 0.980 --- on the upper edge of our two threshold estimators' convergence band [0.973, 0.979]. This is a paper-to-appendix contradiction that a careful reviewer would catch in 30 seconds. BLOCKER B1: BD/McCrary accountant-level claim softened across all four locations to match what Appendix A Table A.I actually reports: - Results IV-D.1 (lines 85-86): rewritten to say the null is not rejected at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting on the upper edge of the convergence band and the one dHash transition at \|Z\|=1.96. - Results IV-E Table VIII row (line 145): "no transition / no transition" changed to "0.980 at bin 0.005 only; null at 0.002, 0.010" / "3.0 at bin 1.0 only ( \|Z\|=1.96); null at 0.2, 0.5". - Results IV-E line 130 (Third finding): "does not produce a significant transition (robust across bin-width sweep)" replaced with "largely null at the accountant level --- no significant transition at 2/3 cosine bin widths and 2/3 dHash bin widths, with the one cosine transition at bin 0.005 sitting at cosine 0.980 on the upper edge of the convergence band". - Results IV-E line 152 (Table VIII synthesis paragraph): matched reframing. - Discussion V-B (line 27): "does not produce a significant transition at the accountant level either" -> "largely null at the accountant level ... with the one cosine transition on the upper edge of the convergence band". - Conclusion (line 16): matched reframing with power caveat retained. MAJOR M1: Related Work L67 stale "well suited to detecting the boundary between two generative mechanisms" framing (residue from pre-demotion drafts) replaced with a local-density-discontinuity diagnostic framing that matches the rest of the paper and flags the signature-level bin-width sensitivity + accountant-level rarity as documented in Appendix A. MAJOR M2: Table XII orphaned in-text anchor --- Table XII is defined inside IV-G.3 but had no in-text "Table XII reports ..." pointer at its presentation location. Added a single sentence before the table comment. MINOR m1: Section IV-I.1 "4 of 30,000+ Firm A documents, 0.01%" replaced with the exact "4 of 30,226 Firm A documents, 0.013%". MINOR m2: Section IV-E "the two-dimensional two-component GMM" wording ambiguity (reader might confuse with the already-selected K*=3 GMM from BIC) replaced with explicit "a separately fit two-component 2D GMM (reported as a cross-check on the 1D accountant-level crossings)". MINOR m3: Section IV-D L59 "downstream all-pairs analyses (Tables XII, XVIII)" misnomer --- Table XII is per-signature classifier output not all-pairs; Table XVIII's all-pairs are over ~16M pairs not 168,740. Replaced with an accurate list: "same-CPA per-signature best-match analyses (Tables V and XII, and the Firm-A per-signature rows of Tables XIII and XVIII)". MINOR m4: Methodology III-H L156 "the validation role is played by ... the held-out Firm A fold" slightly overclaims what the held-out fold establishes (the fold-level rates differ by 1-5 pp with p<0.001). Parenthetical hedge added: "(which confirms the qualitative replication-dominated framing; fold-level rate differences are disclosed in Section IV-G.2)". Also add: - paper/opus_final_review_v3_9.md (Opus 4.7 max-effort review) - paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict, was missing from prior commit) Abstract remains 243 words (under IEEE Access 250 limit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 15:25:04 +08:00
gbanyan	85cfefe49f	Paper A v3.9: resolve codex round-8 regressions (Table XV baseline + cross-refs) Codex round-8 (paper/codex_review_gpt54_v3_8.md) dissented from Gemini's Accept and gave Minor Revision because of two real numerical/consistency issues Gemini's round-7 review missed. This commit fixes both. Table XV per-year Firm A baseline-share column corrected - All 11 yearly values resynced to the authoritative reports/partner_ranking/partner_ranking_report.md (per-year Deloitte baseline share column): 2013: 26.2% -> 32.4% (largest error; codex's test case) 2014: 27.1% -> 27.8% 2015: 27.2% -> 27.7% 2016: 27.4% -> 26.2% 2017: 27.9% -> 27.2% 2018: 28.1% -> 26.5% 2019: 28.2% -> 27.0% 2020: 28.3% -> 27.7% 2021: 28.4% -> 28.7% 2022: 28.5% -> 28.3% 2023: 28.5% -> 27.4% - Codex independently verified that the prior 2013 value 26.2% was numerically impossible because the underlying JSON places 97 Firm A auditor-years in the 2013 top-50% bucket out of 324 total, so the full-year baseline must be at least 97/324 = 29.9%. - All other Table XV columns (N, Top-10% k, in top-10%, share) were already correct and unchanged. Broken cross-references from earlier renumbering repaired - Methodology III-E: "ablation study (Section IV-F)" pointer corrected to "Section IV-J"; the ablation is at Section IV-J line 412 in the current Results, while IV-F is now "Calibration Validation with Firm A". - Results Table XVIII note: "per-signature best-match values in Tables IV/VI (mean = 0.980)" is orphaned after earlier renumbering (Table IV is all-pairs distributional statistics; Table VI is accountant-level GMM model selection). Replaced with an explicit pointer to "Section IV-D and visualized in Table XIII (whole-sample Firm A best-match mean ~ 0.980)". Table XIII is the correct container of per-signature best-match mean statistics. All other Section IV-X cross-references in methodology / results / discussion were spot-checked and remain correct under the current section numbering. With these two surgical fixes, codex's round-8 ranked items (1) and (2) are cleared. Item (3) was the final DOCX packaging pass (author metadata fill-in, figure rendering, reference formatting) which is done manually at submission time and does not affect the markdown. Deferred items remain deferred: - Visual-inspection protocol details (codex round-5 item 4) - General reproducibility appendix (codex round-5 item 6) Both are defensible for first IEEE Access submission per codex round-8 assessment, since the manuscript no longer leans on visual inspection or BD/McCrary as decisive standalone evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 14:59:27 +08:00
gbanyan	fcce58aff0	Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but flagged three issues that five rounds of codex review had missed. This commit addresses all three. BLOCKER: Accountant-level BD/McCrary null is a power artifact, not proof of smoothness (Gemini Issue 1) - At N=686 accountants the BD/McCrary test has limited statistical power; interpreting a failure-to-reject as affirmative proof of smoothness is a Type II error risk. - Discussion V-B: "itself diagnostic of smoothness" replaced with "failure-to-reject rather than a failure of the method --- informative alongside the other evidence but subject to the power caveat in Section V-G". - Discussion V-G (Sixth limitation): added a power-aware paragraph naming N=686 explicitly and clarifying that the substantive claim of smoothly-mixed clustering rests on the JOINT weight of dip test + BIC-selected GMM + BD null, not on BD alone. - Results IV-D.1 and IV-E: reframe accountant-level null as "consistent with --- not affirmative proof of" clustered-but- smoothly-mixed, citing V-G for the power caveat. - Appendix A interpretation paragraph: explicit inferential-asymmetry sentence ("consistency is what the BD null delivers, not affirmative proof"); "itself evidence for" removed. - Conclusion: "consistent with clustered but smoothly mixed" rephrased with explicit power caveat ("at N = 686 the test has limited power and cannot affirmatively establish smoothness"). MAJOR: Table X FRR / EER was tautological reviewer-bait (Gemini Issue 2) - Byte-identical positive anchor has cosine approx 1 by construction, so FRR against that subset is trivially 0 at every threshold below 1 and any EER calculation is arithmetic tautology, not biometric performance. - Results IV-G.1: removed EER row; dropped FRR column from Table X; added a table note explaining the omission and directing readers to Section V-F for the conservative-subset discussion. - Methodology III-K: removed the EER / FRR-against-byte-identical reporting clause; clarified that FAR against inter-CPA negatives is the primary reported quantity. - Table X is now FAR + Wilson 95% CI only, which is the quantity that actually carries empirical content on this anchor design. MINOR: Document-level worst-case aggregation narrative (Gemini Issue 3) + 15-signature delta (Gemini spot-check) - Results IV-I: added two sentences explicitly noting that the document-level percentages reflect the Section III-L worst-case aggregation rule (a report with one stamped + one hand-signed signature inherits the most-replication-consistent label), and cross-referencing Section IV-H.3 / Table XVI for the mixed-report composition that qualifies the headline percentages. - Results IV-D: added a one-sentence footnote explaining that the 15-signature delta between the Table III CPA-matched count (168,755) and the all-pairs analyzed count (168,740) is due to CPAs with exactly one signature, for whom no same-CPA pairwise best-match statistic exists. Abstract remains 243 words, comfortably under the IEEE Access 250-word cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 14:47:48 +08:00
gbanyan	552b6b80d4	Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md, "option (c) hybrid"): demote BD/McCrary in the main text from a co-equal threshold estimator to a density-smoothness diagnostic, and add a bin-width sensitivity appendix as an audit trail. Why: the bin-width sweep (Script 25) confirms that at the signature level the BD transition drifts monotonically with bin width (Firm A cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 -> 0.015; full-sample dHash transitions drift from 2 to 10 to 9 across bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin width, both characteristic of a histogram-resolution artifact. At the accountant level the BD null is robust across the sweep. The paper's earlier "three methodologically distinct estimators" framing therefore could not be defended to an IEEE Access reviewer once the sweep was run. Added - signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep across 6 variants (Firm A / full-sample / accountant-level, each cosine + dHash_indep) and 3-4 bin widths per variant. Reports Z_below, Z_above, p-values, and number of significant transitions per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}. - paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width Sensitivity" with Table A.I (all 20 sensitivity cells) and interpretation linking the empirical pattern to the main-text framing decision. - export_v3.py: appendix inserted into SECTIONS between conclusion and references. - paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation captured verbatim for audit trail. Main-text reframing - Abstract: "three methodologically distinct estimators" -> "two estimators plus a Burgstahler-Dichev/McCrary density- smoothness diagnostic". Trimmed to 243 words. - Introduction: related-work summary, pipeline step 5, accountant- level convergence sentence, contribution 4, and section-outline line all updated. Contribution 4 renamed to "Convergent threshold framework with a smoothness diagnostic". - Methodology III-I: section renamed to "Convergent Threshold Determination with a Density-Smoothness Diagnostic". "Method 2: BD/McCrary Discontinuity" converted to "Density-Smoothness Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered to Method 2. Subsections 4 and 5 updated to refer to "two threshold estimators" with BD as diagnostic. - Methodology III-A pipeline overview: "three methodologically distinct statistical methods" -> "two methodologically distinct threshold estimators complemented by a density-smoothness diagnostic". - Methodology III-L: "three-method analysis" -> "accountant-level threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian robustness crossing)". - Results IV-D.1 heading: "BD/McCrary Discontinuity" -> "BD/McCrary Density-Smoothness Diagnostic". Prose now notes the Appendix-A bin-width instability explicitly. - Results IV-E: Table VIII restructured to label BD rows "(diagnostic only; bin-unstable)" and "(diagnostic; null across Appendix A)". Summary sentence rewritten to frame BD null as evidence for clustered-but-smoothly-mixed rather than as a convergence failure. Table cosine P5 row corrected from 0.941 to 0.9407 to match III-K. - Results IV-G.3 and IV-I.2: "three-method convergence/thresholds" -> "accountant-level convergent thresholds" (clarifies the 3 converging estimates are KDE antimode, Beta-2, logit-Gaussian, not KDE/BD/Beta). - Discussion V-B: "three-method framework" -> "convergent threshold framework". - Conclusion: "three methodologically distinct methods" -> "two threshold estimators and a density-smoothness diagnostic"; contribution 3 restated; future-work sentence updated. - Impact Statement (archived): "three methodologically distinct threshold-selection methods" -> "two methodologically distinct threshold estimators plus a density-smoothness diagnostic" so the archived text is internally consistent if reused. Discussion V-B / V-G already framed BD as a diagnostic in v3.5 (unchanged in this commit). The reframing therefore brings Abstract / Introduction / Methodology / Results / Conclusion into alignment with the Discussion framing that codex had already endorsed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 14:32:50 +08:00
gbanyan	6946baa096	Paper A v3.6: codex round-5 quick-wins cleanup (Minor Revision) Codex gpt-5.4 round-5 (codex_review_gpt54_v3_5.md) verdict was Minor Revision - all v3.4 round-4 PARTIAL/UNFIXED items now confirmed RESOLVED, including line-by-line recomputation of Table XI z/p matching the manuscript values. This commit cleans the remaining quick-win items: Table IX numerical sync to Script 24 authoritative values - Five count corrections: cos>0.837 (60,405->60,408), cos>0.945 (57,131/94.52% -> 56,836/94.02%, was 295 sigs / 0.50 pp off), cos>0.973 (48,910/80.91% -> 48,028/79.45%, was 882 sigs / 1.46 pp off), cos>0.95 (55,916->55,922), dh<=8 (57,521->57,527), dh<=15 (60,345->60,348), dual (54,373->54,370). - Threshold label cos>0.941 -> cos>0.9407 (use exact calib-fold P5 rather than rounded value). - "dHash_indep <= 5 (calib-fold median-adjacent)" relabeled to "(whole-sample upper-tail of mode)" to match what III-L explains. - Added "(operational dual)" / "(style-consistency boundary)" labels for unambiguous mapping into III-L category definitions. - Removed circularity-language footnote inside the table comment. Circularity overclaim removed paper-wide - Methodology III-K (Section 3 anchor): "we break the resulting circularity" -> "we make the within-Firm-A sampling variance visible". - Results IV-G.2 subsection title: "(breaks calibration-validation circularity)" -> "(within-Firm-A sampling variance disclosure)". - Combined with the v3.5 Abstract / Conclusion edits, no surviving use of circular* anywhere in the paper. export_v3.py title page now single-anonymized - Removed "[Authors removed for double-blind review]" placeholder (IEEE Access uses single-anonymized review). - Replaced with explicit "[AUTHOR NAMES - fill in before submission]" + affiliation placeholder so the requirement is unmissable. - Subtitle now reads "single-anonymized review". III-G stale "cosine-conditional dHash" sentence removed - After the v3.5 III-L rewrite to dh_indep, the sentence at Methodology L131 referencing "cosine-conditional dHash used as a diagnostic elsewhere" no longer described any current paper usage. - Replaced with a positive statement that dh_indep is the dHash statistic used throughout the operational classifier and all reported capture-rate analyses. Abstract trimmed 247 -> 242 words for IEEE 250-word safety margin - "an end-to-end pipeline" -> "a pipeline"; "Unlike signature forgery" -> "Unlike forgery"; "we report" passive recast; small conjunction trims. Outstanding items deferred (require user decision / larger scope): - BD/McCrary either substantiate (Z/p table + bin-width robustness) or demote to supplementary diagnostic. - Visual-inspection protocol disclosure (sample size, rater count, blinding, adjudication rule). - Reproducibility appendix (VLM prompt, HSV thresholds, seeds, EM init / stopping / boundary handling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 12:41:11 +08:00
gbanyan	12f716ddf1	Paper A v3.5: resolve codex round-4 residual issues Fully addresses the partial-resolution / unfixed items from codex gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md): Critical - Table XI z/p columns now reproduce from displayed counts. Earlier table had 1-4-unit transcription errors in k values and a fabricated cos > 0.9407 calibration row; both fixed by rerunning Script 24 with cos = 0.9407 added to COS_RULES and copying exact values from the JSON output. - Section III-L classifier now defined entirely in terms of the independent-minimum dHash statistic that the deployed code (Scripts 21, 23, 24) actually uses; the legacy "cosine-conditional dHash" language is removed. Tables IX, XI, XII, XVI are now arithmetically consistent with the III-L classifier definition. - "0.95 not calibrated to Firm A" inconsistency reconciled: Section III-H now correctly says 0.95 is the whole-sample Firm A P95 of the per-signature cosine distribution, matching III-L and IV-F. Major - Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word limit. Removed "we break the circularity" overclaim; replaced with "report capture rates on both folds with Wilson 95% intervals to make fold-level variance visible". - Conclusion mirrors the Abstract reframe: 70/30 split documents within-firm sampling variance, not external generalization. - Introduction no longer promises precision / F1 / EER metrics that Methods/Results don't deliver; replaced with anchor-based capture / FAR + Wilson CI language. - Section III-G within-auditor-year empirical-check wording corrected: intra-report consistency (IV-H.3) is a different test (two co-signers on the same report, firm-level homogeneity) and is not a within-CPA year-level mixing check; the assumption is maintained as a bounded identification convention. - Section III-H "two analyses fully threshold-free" corrected to "only the partner-level ranking is threshold-free"; longitudinal-stability uses 0.95 cutoff, intra-report uses the operational classifier. Minor - Impact Statement removed from export_v3.py SECTIONS list (IEEE Access Regular Papers do not have a standalone Impact Statement). The file itself is retained as an archived non-paper note for cover-letter / grant-report reuse, with a clear archive header. - All 7 previously unused references ([27] dHash, [31][32] partner- signature mandates, [33] Taiwan partner rotation, [34] YOLO original, [35] VLM survey, [36] Mann-Whitney) are now cited in-text: [27] in Methodology III-E (dHash definition) [31][32][33] in Introduction (audit-quality regulation context) [34][35] in Methodology III-C/III-D [36] in Results IV-C (Mann-Whitney result) Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's calibration-fold P5 row is computed from the same data file as the other rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 12:23:03 +08:00
gbanyan	0ff1845b22	Paper A v3.4: resolve codex round-3 major-revision blockers Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md): B1 Classifier vs three-method threshold mismatch - Methodology III-L rewritten to make explicit that the per-signature classifier and the accountant-level three-method convergence operate at different units (signature vs accountant) and are complementary rather than substitutable. - Add Results IV-G.3 + Table XII operational-threshold sensitivity: cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary. B2 Held-out validation false "within Wilson CI" claim - Script 24 recomputes both calibration-fold and held-out-fold rates with Wilson 95% CIs and a two-proportion z-test on each rule. - Table XI replaced with the proper fold-vs-fold comparison; prose in Results IV-G.2 and Discussion V-C corrected: extreme rules agree across folds (p>0.7); operational rules in the 85-95% band differ by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample contained more high-replication C1 accountants), not generalization failure. B3 Interview evidence reframed as practitioner knowledge - The Firm A "interviews" referenced throughout v3.3 are private, informal professional conversations, not structured research interviews. Reframed accordingly: all "interview*" references in abstract / intro / methodology / results / discussion / conclusion are replaced with "domain knowledge / industry-practice knowledge". - This avoids overclaiming methodological formality and removes the human-subjects research framing that triggered the ethics-statement requirement. - Section III-H four-pillar Firm A validation now stands on visual inspection, signature-level statistics, accountant-level GMM, and the three Section IV-H analyses, with practitioner knowledge as background context only. - New Section III-M ("Data Source and Firm Anonymization") covers MOPS public-data provenance, Firm A/B/C/D pseudonymization, and conflict-of-interest declaration. Add signature_analysis/24_validation_recalibration.py for the recomputed calib-vs-held-out z-tests and the classifier sensitivity analysis; output in reports/validation_recalibration/. Pending (not in this commit): abstract length (368 -> 250 words), Impact Statement removal, BD/McCrary sensitivity reporting, full reproducibility appendix, references cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 11:45:24 +08:00
gbanyan	5717d61dd4	Paper A v3.3: apply codex v3.2 peer-review fixes Codex (gpt-5.4) second-round review recommended 'minor revision'. This commit addresses all issues flagged in that review. ## Structural fixes - dHash calibration inconsistency (codex #1, most important): Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come from the whole-sample Firm A cosine-conditional dHash distribution (median=5, P95=15), not from the calibration-fold independent-minimum dHash distribution (median=2, P95=9) which we report elsewhere as descriptive anchors. Added explicit note about the two dHash conventions and their relationship. - Section IV-H framing (codex #2): Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence" to "Additional Firm A Benchmark Validation" and clarified in the section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully threshold-free, H.3 uses the calibrated classifier. H.3's concluding sentence now says "the substantive evidence lies in the cross-firm gap" rather than claiming the test is threshold-free. - Table XVI 93,979 typo fixed (codex #3): Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm). - Held-out Firm A denominator 124+54=178 vs 180 (codex #4): Added explicit note that 2 CPAs were excluded due to disambiguation ties in the CPA registry. - Table VIII duplication (codex #5): Removed the duplicate accountant-level-only Table VIII comment; the comprehensive cross-level Table VIII subsumes it. Text now says "accountant-level rows of Table VIII (below)". - Anonymization broken in Tables XIV-XVI (codex #6): Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/ "Firm D" across Tables XIV, XV, XVI. Table and caption language updated accordingly. - Table X unit mismatch (codex #7): Dropped precision, recall, F1 columns. Table now reports FAR (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR (against the byte-identical positive anchor). III-K and IV-G.1 text updated to justify the change. ## Sentence-level fixes - "three independent statistical methods" in Methodology III-A -> "three methodologically distinct statistical methods". - "three independent methods" in Conclusion -> "three methodologically distinct methods". - Abstract "~0.006 converging" now explicitly acknowledges that BD/McCrary produces no significant accountant-level discontinuity. - Conclusion ditto. - Discussion limitation sentence "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" rewritten to reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold estimator, at the accountant level. - III-H "two analyses" -> "three analyses" (H.1 longitudinal stability, H.2 partner ranking, H.3 intra-report consistency). - Related Work White 1982 overclaim rewritten: "consistent estimators of the pseudo-true parameter that minimizes KL divergence" replaces "guarantees asymptotic recovery". - III-J "behavior is close to discrete" -> "practice is clustered". - IV-D.2 pivot sentence "discreteness of individual behavior yields bimodality" -> "aggregation over signatures reveals clustered (though not sharply discrete) patterns". Target journal remains IEEE Access. Output: Paper_A_IEEE_Access_Draft_v3.docx (395 KB). Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 02:32:17 +08:00
gbanyan	51d15b32a5	Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation) Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements; partner confirmed the 2013-2019 restriction was an error (sample stays 2013-2023). The remaining suggestions are adopted with our own data. ## New scripts - Script 22 (partner ranking): ranks all Big-4 auditor-years by mean max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x concentration ratio. Stable across 2013-2023 (88-100% per year). - Script 23 (intra-report consistency): for each 2-signer report, classify both signatures and check agreement. Firm A agrees 89.9% vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers non-hand-signed; only 4 reports (0.01%) both hand-signed. ## New methodology additions - III-G: explicit within-auditor-year no-mixing identification assumption (supported by Firm A interview evidence). - III-H: 4th Firm A validation line: threshold-independent evidence from partner ranking + intra-report consistency. ## New results section IV-H (threshold-independent validation) - IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%, 2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts partner's hypothesis that 2020+ electronic systems increase heterogeneity -- data shows opposite (electronic systems more consistent than physical stamping). - IV-H.2: partner ranking top-K tables (pooled + year-by-year). - IV-H.3: intra-report consistency per-firm table. ## Renumbering - Section H (was Classification Results) -> I - Section I (was Ablation) -> J - Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year, intra-report), XVII = classification (was XII), XVIII = ablation (was XIII). These threshold-independent analyses address the codex review concern about circular validation by providing benchmark evidence that does not depend on any threshold calibrated to Firm A itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 01:59:49 +08:00
gbanyan	9d19ca5a31	Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21 Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 01:11:51 +08:00
gbanyan	9b11f03548	Paper A v3: full rewrite for IEEE Access with three-method convergence Major changes from v2: Terminology: - "digitally replicated" -> "non-hand-signed" throughout (per partner v3 feedback and to avoid implicit accusation) - "Firm A near-universal non-hand-signing" -> "replication-dominated" (per interview nuance: most but not all Firm A partners use replication) Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list) New methodological sections (III.G-III.L + IV.D-IV.G): - Three convergent threshold methods (KDE antimode + Hartigan dip test / Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM robustness check) - Explicit unit-of-analysis discussion (signature vs accountant) - Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically) - Pixel-identity validation anchor (no manual annotation needed) - Low-similarity negative anchor + Firm A replication-dominated anchor New empirical findings integrated: - Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority hand-signers - Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp mixture) - signature-level is continuous quality spectrum - Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141, C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10 - Pixel-identity anchor (310 pairs) gives perfect recall at all cosine thresholds - Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95% New discussion section V.B: "Continuous-quality spectrum vs discrete- behavior regimes" - the core interpretive contribution of v3. References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997, McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41). export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2 from expanded methodology + results sections). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 00:14:47 +08:00

16 Commits