Paper A v3.4: resolve codex round-3 major-revision blockers
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):
B1 Classifier vs three-method threshold mismatch
- Methodology III-L rewritten to make explicit that the per-signature
classifier and the accountant-level three-method convergence operate
at different units (signature vs accountant) and are complementary
rather than substitutable.
- Add Results IV-G.3 + Table XII operational-threshold sensitivity:
cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.
B2 Held-out validation false "within Wilson CI" claim
- Script 24 recomputes both calibration-fold and held-out-fold rates
with Wilson 95% CIs and a two-proportion z-test on each rule.
- Table XI replaced with the proper fold-vs-fold comparison; prose
in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
across folds (p>0.7); operational rules in the 85-95% band differ
by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
contained more high-replication C1 accountants), not generalization
failure.
B3 Interview evidence reframed as practitioner knowledge
- The Firm A "interviews" referenced throughout v3.3 are private,
informal professional conversations, not structured research
interviews. Reframed accordingly: all "interview*" references in
abstract / intro / methodology / results / discussion / conclusion
are replaced with "domain knowledge / industry-practice knowledge".
- This avoids overclaiming methodological formality and removes the
human-subjects research framing that triggered the ethics-statement
requirement.
- Section III-H four-pillar Firm A validation now stands on visual
inspection, signature-level statistics, accountant-level GMM, and
the three Section IV-H analyses, with practitioner knowledge as
background context only.
- New Section III-M ("Data Source and Firm Anonymization") covers
MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
conflict-of-interest declaration.
Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.
Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -18,7 +18,7 @@ The substantive reading is therefore narrower than "discrete behavior": *pixel-l
|
||||
|
||||
Fourth, we introduced a *replication-dominated* calibration methodology---explicitly distinguishing replication-dominated from replication-pure calibration anchors and validating classification against a byte-level pixel-identity anchor (310 byte-identical signatures) paired with a $\sim$50,000-pair inter-CPA negative anchor.
|
||||
To break the circularity of using the calibration firm as its own validation reference, we split the firm's CPAs 70/30 at the CPA level and report post-hoc capture rates on the held-out fold with Wilson 95% confidence intervals.
|
||||
This framing is internally consistent with all available evidence: interview reports that the calibration firm uses non-hand-signing for most but not all partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
|
||||
This framing is internally consistent with all available evidence: the visual-inspection observation of pixel-identical signatures across unrelated audit engagements for the majority of calibration-firm partners; the 92.5% / 7.5% split in signature-level cosine thresholds; and, among the 171 calibration-firm CPAs with enough signatures to enter the accountant-level GMM (of 180 in total), the 139 / 32 split between the high-replication and middle-band clusters.
|
||||
|
||||
An ablation study comparing ResNet-50, VGG-16 and EfficientNet-B0 confirmed that ResNet-50 offers the best balance of discriminative power, classification stability, and computational efficiency for this task.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user