6946baa0965216e872b5eb4c80f87e86ffef35c9
8 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
12f716ddf1 |
Paper A v3.5: resolve codex round-4 residual issues
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):
Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
table had 1-4-unit transcription errors in k values and a fabricated
cos > 0.9407 calibration row; both fixed by rerunning Script 24
with cos = 0.9407 added to COS_RULES and copying exact values from
the JSON output.
- Section III-L classifier now defined entirely in terms of the
independent-minimum dHash statistic that the deployed code (Scripts
21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
language is removed. Tables IX, XI, XII, XVI are now arithmetically
consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
per-signature cosine distribution, matching III-L and IV-F.
Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
limit. Removed "we break the circularity" overclaim; replaced with
"report capture rates on both folds with Wilson 95% intervals to
make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
Methods/Results don't deliver; replaced with anchor-based capture /
FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
intra-report consistency (IV-H.3) is a different test (two co-signers
on the same report, firm-level homogeneity) and is not a within-CPA
year-level mixing check; the assumption is maintained as a bounded
identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
the partner-level ranking is threshold-free"; longitudinal-stability
uses 0.95 cutoff, intra-report uses the operational classifier.
Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
Regular Papers do not have a standalone Impact Statement). The file
itself is retained as an archived non-paper note for cover-letter /
grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
[35] VLM survey, [36] Mann-Whitney) are now cited in-text:
[27] in Methodology III-E (dHash definition)
[31][32][33] in Introduction (audit-quality regulation context)
[34][35] in Methodology III-C/III-D
[36] in Results IV-C (Mann-Whitney result)
Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
0ff1845b22 |
Paper A v3.4: resolve codex round-3 major-revision blockers
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):
B1 Classifier vs three-method threshold mismatch
- Methodology III-L rewritten to make explicit that the per-signature
classifier and the accountant-level three-method convergence operate
at different units (signature vs accountant) and are complementary
rather than substitutable.
- Add Results IV-G.3 + Table XII operational-threshold sensitivity:
cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.
B2 Held-out validation false "within Wilson CI" claim
- Script 24 recomputes both calibration-fold and held-out-fold rates
with Wilson 95% CIs and a two-proportion z-test on each rule.
- Table XI replaced with the proper fold-vs-fold comparison; prose
in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
across folds (p>0.7); operational rules in the 85-95% band differ
by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
contained more high-replication C1 accountants), not generalization
failure.
B3 Interview evidence reframed as practitioner knowledge
- The Firm A "interviews" referenced throughout v3.3 are private,
informal professional conversations, not structured research
interviews. Reframed accordingly: all "interview*" references in
abstract / intro / methodology / results / discussion / conclusion
are replaced with "domain knowledge / industry-practice knowledge".
- This avoids overclaiming methodological formality and removes the
human-subjects research framing that triggered the ethics-statement
requirement.
- Section III-H four-pillar Firm A validation now stands on visual
inspection, signature-level statistics, accountant-level GMM, and
the three Section IV-H analyses, with practitioner knowledge as
background context only.
- New Section III-M ("Data Source and Firm Anonymization") covers
MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
conflict-of-interest declaration.
Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.
Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
51d15b32a5 |
Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements; partner confirmed the 2013-2019 restriction was an error (sample stays 2013-2023). The remaining suggestions are adopted with our own data. ## New scripts - Script 22 (partner ranking): ranks all Big-4 auditor-years by mean max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x concentration ratio. Stable across 2013-2023 (88-100% per year). - Script 23 (intra-report consistency): for each 2-signer report, classify both signatures and check agreement. Firm A agrees 89.9% vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers non-hand-signed; only 4 reports (0.01%) both hand-signed. ## New methodology additions - III-G: explicit within-auditor-year no-mixing identification assumption (supported by Firm A interview evidence). - III-H: 4th Firm A validation line: threshold-independent evidence from partner ranking + intra-report consistency. ## New results section IV-H (threshold-independent validation) - IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%, 2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts partner's hypothesis that 2020+ electronic systems increase heterogeneity -- data shows opposite (electronic systems more consistent than physical stamping). - IV-H.2: partner ranking top-K tables (pooled + year-by-year). - IV-H.3: intra-report consistency per-firm table. ## Renumbering - Section H (was Classification Results) -> I - Section I (was Ablation) -> J - Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year, intra-report), XVII = classification (was XII), XVIII = ablation (was XIII). These threshold-independent analyses address the codex review concern about circular validation by providing benchmark evidence that does not depend on any threshold calibrated to Firm A itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9d19ca5a31 |
Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review: ## Structural fixes - Fixed three-method convergence overclaim: added Script 20 to run KDE antimode, BD/McCrary, and Beta mixture EM on accountant-level means. Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979, LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at accountant level (consistent with smooth clustering, not sharp discontinuity). - Disambiguated Method 1: KDE crossover (between two labeled distributions, used at signature all-pairs level) vs KDE antimode (single-distribution local minimum, used at accountant level). - Addressed Firm A circular validation: Script 21 adds CPA-level 70/30 held-out fold. Calibration thresholds derived from 70% only; heldout rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61% [93.21%-93.98%]). - Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10 signatures (9 CPAs excluded for insufficient sample). Reconciled across intro, results, discussion, conclusion. - Added document-level classification aggregation rule (worst-case signature label determines document label). ## Pixel-identity validation strengthened - Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces the original n=35 same-CPA low-similarity negative which had untenable Wilson CIs). - Added Wilson 95% CI for every FAR in Table X. - Proper EER interpolation (FAR=FRR point) in Table X. - Softened "conservative recall" claim to "non-generalizable subset" language per codex feedback (byte-identical positives are a subset, not a representative positive class). - Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913. ## Terminology & sentence-level fixes - "statistically independent methods" -> "methodologically distinct methods" throughout (three diagnostics on the same sample are not independent). - "formal bimodality check" -> "unimodality test" (dip test tests H0 of unimodality; rejection is consistent with but not a direct test of bimodality). - "Firm A near-universally non-hand-signed" -> already corrected to "replication-dominated" in prior commit; this commit strengthens that framing with explicit held-out validation. - "discrete-behavior regimes" -> "clustered accountant-level heterogeneity" (BD/McCrary non-transition at accountant level rules out sharp discrete boundaries; the defensible claim is clustered-but-smooth). - Softened White 1982 quasi-MLE claim (no longer framed as a guarantee). - Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP or YOLO FN). - Unified "310 byte-identical signatures" language across Abstract, Results, Discussion (previously alternated between pairs/signatures). - Defined min_dhash_independent explicitly in Section III-G. - Fixed table numbering (Table XI heldout added, classification moved to XII, ablation to XIII). - Explained 84,386 vs 85,042 gap (656 docs have only one signature, no pairwise stat). - Made Table IX explicitly a "consistency check" not "validation"; paired it with Table XI held-out rates as the genuine external check. - Defined 0.941 threshold (calibration-fold Firm A cosine P5). - Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated. - Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923). ## New artifacts - Script 20: accountant-level three-method threshold analysis - Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30) - paper/codex_review_gpt54_v3.md: preserved review feedback Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1 markdown sources). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
68689c9f9b |
Correct Firm A framing: replication-dominated, not pure
Interview evidence from multiple Firm A accountants confirms that MOST use replication (stamping / firm-level e-signing) but a MINORITY may still hand-sign. Firm A is therefore a "replication-dominated" population, not a "pure" one. This framing is consistent with: - 92.5% of Firm A signatures exceed cosine 0.95 (majority replication) - The long left tail (~7%) captures the minority hand-signers, not scan noise or preprocessing artifacts - Hartigan dip test: Firm A cosine unimodal long-tail (p=0.17) - Accountant-level GMM: of 180 Firm A accountants, 139 cluster in C1 (high-replication) and 32 in C2 (middle band = minority hand-signers) Updates docstrings and report text in Scripts 15, 16, 18, 19 to match. Partner v3's "near-universal non-hand-signing" language corrected. Script 19 regenerated with the updated text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fbfab1fa68 |
Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of signature vs. accountant analysis units: - Script 15 (Hartigan dip test): formal unimodality test via `diptest`. Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population); full-sample cosine MULTIMODAL (p<0.001, mix of two regimes); accountant-level aggregates MULTIMODAL on both cos and dHash. - Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition detection. Firm A and full-sample cosine transitions at 0.985; dHash at 2.0. - Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM with MoM M-step, plus parallel Gaussian mixture on logit transform as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature level confirms 2-component is a forced fit -- supporting the pivot to accountant-level mixture. - Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis that was done inline and not saved. BIC-best K=3 with components matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%, Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928, 11.17, 28%, small firms). 2-component natural thresholds: cos=0.9450, dh=8.10. - Script 19 (Pixel-identity validation): no human annotation needed. Uses pixel_identical_to_closest (310 sigs) as gold positive and Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51% (matches prior 2026-04-08 finding of 92.5%), dual rule cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A. Python deps added: diptest, scikit-learn (installed into venv). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a261a22bd2 |
Add Deloitte distribution & independent dHash analysis scripts
- Script 13: Firm A normality/multimodality analysis (Shapiro-Wilk, Anderson-Darling, KDE, per-accountant ANOVA, Beta/Gamma fitting) - Script 14: Independent min-dHash computation across all pairs per accountant (not just cosine-nearest pair) - THRESHOLD_VALIDATION_OPTIONS: 2026-01 discussion doc on threshold validation approaches - .gitignore: exclude model weights, node artifacts, and xlsx data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
939a348da4 |
Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification
Paper draft includes all sections (Abstract through Conclusion), 36 references, and supporting scripts. Key methodology: Cosine similarity + dHash dual-method verification with thresholds calibrated against known-replication firm (Firm A). Includes: - 8 section markdown files (paper_a_*.md) - Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0) - Recalibrated classification script (84,386 PDFs, 5-tier system) - Figure generation and Word export scripts - Citation renumbering script ([1]-[36]) - Signature analysis pipeline (12 steps) - YOLO extraction scripts Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |