0471e36fd45a2cd956912b3e9e4db4f4cf328b38
8 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0471e36fd4 |
Paper A v3.16: remove unsupported visual-inspection / sanity-sample claims
User review of the v3.15 Sanity Sample subsection revealed that the paper's claim of "inter-rater agreement with the classifier in all 30 cases" (Results IV-G.4) was not backed by any data artifact in the repository. Script 19 exports a 30-signature stratified sample to reports/pixel_validation/sanity_sample.csv, but that CSV contains only classifier output fields (stratum, sig_id, cosine, dhash_indep, pixel_identical, closest_match) and no human-annotation column, and no subsequent script computes any human--classifier agreement metric. User confirmed that the only human annotation in the project was the YOLO training-set bounding-box labeling; signature classification (stamped vs hand-signed) was done entirely by automated numerical methods. The 30/30 sanity-sample claim was therefore factually unsupported and has been removed. Investigation additionally revealed that the "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" framing used as the first strand of Firm A's replication-dominated evidence (Section III-H first strand, Section V-C first strand, and the Conclusion fourth contribution) had the same provenance problem: no human visual inspection was performed. The underlying FACT (that Firm A contains many byte-identical same-CPA signature pairs) is correct and fully supported by automated byte-level pair analysis (Script 19), but the "visual inspection" phrasing misrepresents the provenance. Changes: 1. Results IV-G.4 "Sanity Sample" subsection deleted entirely (results_v3.md L271-273). 2. Methodology III-K penultimate paragraph describing the 30-signature manual visual sanity inspection deleted (methodology_v3.md L259). 3. Methodology Section III-H first strand (L152) rewritten from "independent visual inspection of randomly sampled Firm A reports reveals pixel-identical signature images...for many of the sampled partners" to "automated byte-level pair analysis (Section IV-G.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years." All four numbers verified directly from the signature_analysis.db database via pixel_identical_to_closest = 1 filter joined to accountants.firm. 4. Discussion V-C first strand (L41) rewritten analogously to refer to byte-level pair evidence with the same four verified numbers. 5. Conclusion fourth contribution (L21) rewritten to "byte-level pair analysis finding of 145 pixel-identical calibration-firm signatures across 50 distinct partners (Section IV-G.1)." 6. Abstract (L5): "visual inspection and accountant-level mixture evidence..." rewritten as "byte-level pixel-identity evidence (145 signatures across 50 partners) and accountant-level mixture evidence..." Abstract now at 250/250 words. 7. Introduction (L55): "visual-inspection evidence" relabeled "byte-level pixel-identity evidence" for internal consistency. 8. Methodology III-H penultimate (L164): "validation role is played by the visual inspection" relabeled "validation role is played by the byte-level pixel-identity evidence" for consistency. All substantive claims are preserved and now back-traceable to Script 19 output and the signature_analysis.db pixel_identical_to_closest flag. This correction brings the paper's descriptive language into strict alignment with its actual methodology, which is fully automated (except for YOLO training annotation, disclosed in Methodology Section III-B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ef0e417257 |
Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings
Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues;
codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught
one additional cosine-P95 ambiguity Opus missed (methodology L255).
Total 12 text-only edits across 5 files.
MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite
the v3.12-corrected Section III-L but still wrote "P95" (self-
contradiction). Fix: methodology L165 and results L247 both restated
as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5%
complement spelled out.
MINOR findings and fixes:
- m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2
L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually
pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both
sites now say "every auditor-year ... across all firms."
- m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21
now add "of 180 registered CPAs; 178 after excluding two with
disambiguation ties, Section IV-G.2" parenthetical to avoid the
misleading 180−171=9 reading.
- m3 IV-H.1 A2 citation: results L286 now explicitly invokes the
A2 within-year label-uniformity convention (Section III-G) when
reading the left-tail share as a partner-level "minority of hand-
signers."
- m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H
→ Section III-L anchor, and added explicit note that the 0.95
heuristic is a whole-sample anchor while Table XI thresholds are
calibration-fold-derived (cosine P5 = 0.9407).
- m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap:
results L406 now explains the 4-report difference (XVI restricts
to both-signers-Firm-A single-firm two-signer reports; XVII counts
at-least-one-Firm-A signer under the 84,386-document cohort).
- m6 Methodology L156 "four independent quantitative analyses"
actually enumerated 6 items: rephrased as "three primary
independent quantitative analyses plus a fourth strand comprising
three complementary checks."
- m7 Abstract "cluster into three groups" restored the "smoothly-
mixed" qualifier to match Discussion V-B and Conclusion L17.
- Codex-caught residue at methodology L255 ("Median, 1st percentile,
and 95th percentile of signature-level cosine/dHash distributions")
grammatically applied P95 to cosine too. Rewrote as
"cosine median, P1, and P5 (lower-tail) and dHash_indep median
and P95 (upper-tail)" matching Table XI L233 exactly.
No re-computation. All tables (IV-XVIII) and Appendix A numbers
unchanged. Abstract at 249/250 words after smoothly-mixed qualifier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d2f8673a67 |
Paper A v3.11: reframe Section III-G unit hierarchy + propagate implications
Rewrites Section III-G (Unit of Analysis and Summary Statistics) after self-review identified three logical issues in v3.10: 1. Ordering inversion: the three units are now ordered signature -> auditor-year -> accountant, with auditor-year as the principled middle unit under within-year assumptions and accountant as a deliberate cross-year pooling. 2. Oversold assumption: the old "within-auditor-year no-mixing identification assumption" is split into A1 (pair-detectability, weak statistical, cross-year scope matching the detector) and A2 (within-year label uniformity, interpretive convention). The arithmetic statistics reported in the paper do not require A2; A2 only underwrites interpretive readings (notably IV-H.1's partner- level "minority of hand-signers" framing). 3. Motivation-assumption mismatch: removed the "longitudinal behaviour of interest" framing and explicitly disclaimed across-year homogeneity. Accountant-level coordinates are now described as a pooled observed tendency rather than a time-invariant regime. Propagated implications across Introduction, Discussion, and Results: softened "tends to cluster into a dominant regime" and "directly quantifying the minority of hand-signers" to "pooled observed tendency" / "consistent with within-firm heterogeneity"; rewrote the Limitations fifth point (was "treats all signatures from a CPA as a single class"); added a seventh Limitation acknowledging the source-template edge case; added a per-signature best-match cross-year caveat to Section IV-H.2; softened IV-H.2's "direct consequence" to "consistent with"; reframed pixel-identity anchor as pair-level proof of image reuse (with source-template exception) rather than absolute signature-level positive. Process: self-review (9 findings) -> full-pass fixes -> codex gpt-5.5 xhigh round-10 verification (8 RESOLVED, 1 PARTIAL, 4 MINOR regression findings) -> regression fixes. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 248/250 words. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
552b6b80d4 |
Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A
Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md,
"option (c) hybrid"): demote BD/McCrary in the main text from a co-equal
threshold estimator to a density-smoothness diagnostic, and add a
bin-width sensitivity appendix as an audit trail.
Why: the bin-width sweep (Script 25) confirms that at the signature
level the BD transition drifts monotonically with bin width (Firm A
cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 ->
0.015; full-sample dHash transitions drift from 2 to 10 to 9 across
bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin
width, both characteristic of a histogram-resolution artifact. At the
accountant level the BD null is robust across the sweep. The paper's
earlier "three methodologically distinct estimators" framing therefore
could not be defended to an IEEE Access reviewer once the sweep was
run.
Added
- signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep
across 6 variants (Firm A / full-sample / accountant-level, each
cosine + dHash_indep) and 3-4 bin widths per variant. Reports
Z_below, Z_above, p-values, and number of significant transitions
per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}.
- paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width
Sensitivity" with Table A.I (all 20 sensitivity cells) and
interpretation linking the empirical pattern to the main-text
framing decision.
- export_v3.py: appendix inserted into SECTIONS between conclusion
and references.
- paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation
captured verbatim for audit trail.
Main-text reframing
- Abstract: "three methodologically distinct estimators" ->
"two estimators plus a Burgstahler-Dichev/McCrary density-
smoothness diagnostic". Trimmed to 243 words.
- Introduction: related-work summary, pipeline step 5, accountant-
level convergence sentence, contribution 4, and section-outline
line all updated. Contribution 4 renamed to "Convergent threshold
framework with a smoothness diagnostic".
- Methodology III-I: section renamed to "Convergent Threshold
Determination with a Density-Smoothness Diagnostic". "Method 2:
BD/McCrary Discontinuity" converted to "Density-Smoothness
Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered
to Method 2. Subsections 4 and 5 updated to refer to "two threshold
estimators" with BD as diagnostic.
- Methodology III-A pipeline overview: "three methodologically
distinct statistical methods" -> "two methodologically distinct
threshold estimators complemented by a density-smoothness
diagnostic".
- Methodology III-L: "three-method analysis" -> "accountant-level
threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian
robustness crossing)".
- Results IV-D.1 heading: "BD/McCrary Discontinuity" ->
"BD/McCrary Density-Smoothness Diagnostic". Prose now notes the
Appendix-A bin-width instability explicitly.
- Results IV-E: Table VIII restructured to label BD rows
"(diagnostic only; bin-unstable)" and "(diagnostic; null across
Appendix A)". Summary sentence rewritten to frame BD null as
evidence for clustered-but-smoothly-mixed rather than as a
convergence failure. Table cosine P5 row corrected from 0.941 to
0.9407 to match III-K.
- Results IV-G.3 and IV-I.2: "three-method convergence/thresholds"
-> "accountant-level convergent thresholds" (clarifies the 3
converging estimates are KDE antimode, Beta-2, logit-Gaussian,
not KDE/BD/Beta).
- Discussion V-B: "three-method framework" -> "convergent threshold
framework".
- Conclusion: "three methodologically distinct methods" -> "two
threshold estimators and a density-smoothness diagnostic";
contribution 3 restated; future-work sentence updated.
- Impact Statement (archived): "three methodologically distinct
threshold-selection methods" -> "two methodologically distinct
threshold estimators plus a density-smoothness diagnostic" so the
archived text is internally consistent if reused.
Discussion V-B / V-G already framed BD as a diagnostic in v3.5
(unchanged in this commit). The reframing therefore brings Abstract /
Introduction / Methodology / Results / Conclusion into alignment with
the Discussion framing that codex had already endorsed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
12f716ddf1 |
Paper A v3.5: resolve codex round-4 residual issues
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
0ff1845b22 |
Paper A v3.4: resolve codex round-3 major-revision blockers
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):
B1 Classifier vs three-method threshold mismatch
- Methodology III-L rewritten to make explicit that the per-signature
classifier and the accountant-level three-method convergence operate
at different units (signature vs accountant) and are complementary
rather than substitutable.
- Add Results IV-G.3 + Table XII operational-threshold sensitivity:
cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.
B2 Held-out validation false "within Wilson CI" claim
- Script 24 recomputes both calibration-fold and held-out-fold rates
with Wilson 95% CIs and a two-proportion z-test on each rule.
- Table XI replaced with the proper fold-vs-fold comparison; prose
in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
across folds (p>0.7); operational rules in the 85-95% band differ
by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
contained more high-replication C1 accountants), not generalization
failure.
B3 Interview evidence reframed as practitioner knowledge
- The Firm A "interviews" referenced throughout v3.3 are private,
informal professional conversations, not structured research
interviews. Reframed accordingly: all "interview*" references in
abstract / intro / methodology / results / discussion / conclusion
are replaced with "domain knowledge / industry-practice knowledge".
- This avoids overclaiming methodological formality and removes the
human-subjects research framing that triggered the ethics-statement
requirement.
- Section III-H four-pillar Firm A validation now stands on visual
inspection, signature-level statistics, accountant-level GMM, and
the three Section IV-H analyses, with practitioner knowledge as
background context only.
- New Section III-M ("Data Source and Firm Anonymization") covers
MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
conflict-of-interest declaration.
Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.
Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9d19ca5a31 |
Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9b11f03548 |
Paper A v3: full rewrite for IEEE Access with three-method convergence
Major changes from v2: Terminology: - "digitally replicated" -> "non-hand-signed" throughout (per partner v3 feedback and to avoid implicit accusation) - "Firm A near-universal non-hand-signing" -> "replication-dominated" (per interview nuance: most but not all Firm A partners use replication) Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list) New methodological sections (III.G-III.L + IV.D-IV.G): - Three convergent threshold methods (KDE antimode + Hartigan dip test / Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM robustness check) - Explicit unit-of-analysis discussion (signature vs accountant) - Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically) - Pixel-identity validation anchor (no manual annotation needed) - Low-similarity negative anchor + Firm A replication-dominated anchor New empirical findings integrated: - Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority hand-signers - Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp mixture) - signature-level is continuous quality spectrum - Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141, C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10 - Pixel-identity anchor (310 pairs) gives perfect recall at all cosine thresholds - Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95% New discussion section V.B: "Continuous-quality spectrum vs discrete- behavior regimes" - the core interpretive contribution of v3. References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997, McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41). export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2 from expanded methodology + results sections). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |