Commit Graph

11 Commits

Author SHA1 Message Date
gbanyan fcce58aff0 Paper A v3.8: resolve Gemini 3.1 Pro round-6 independent-review findings
Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.

BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
  power; interpreting a failure-to-reject as affirmative proof of
  smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
  "failure-to-reject rather than a failure of the method ---
  informative alongside the other evidence but subject to the power
  caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
  naming N=686 explicitly and clarifying that the substantive claim
  of smoothly-mixed clustering rests on the JOINT weight of dip
  test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
  "consistent with --- not affirmative proof of" clustered-but-
  smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
  sentence ("consistency is what the BD null delivers, not
  affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
  rephrased with explicit power caveat ("at N = 686 the test has
  limited power and cannot affirmatively establish smoothness").

MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
  so FRR against that subset is trivially 0 at every threshold
  below 1 and any EER calculation is arithmetic tautology, not
  biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
  added a table note explaining the omission and directing readers
  to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
  reporting clause; clarified that FAR against inter-CPA negatives
  is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
  that actually carries empirical content on this anchor design.

MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
  document-level percentages reflect the Section III-L worst-case
  aggregation rule (a report with one stamped + one hand-signed
  signature inherits the most-replication-consistent label), and
  cross-referencing Section IV-H.3 / Table XVI for the mixed-report
  composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
  15-signature delta between the Table III CPA-matched count
  (168,755) and the all-pairs analyzed count (168,740) is due to
  CPAs with exactly one signature, for whom no same-CPA pairwise
  best-match statistic exists.

Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:47:48 +08:00
gbanyan 552b6b80d4 Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A
Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md,
"option (c) hybrid"): demote BD/McCrary in the main text from a co-equal
threshold estimator to a density-smoothness diagnostic, and add a
bin-width sensitivity appendix as an audit trail.

Why: the bin-width sweep (Script 25) confirms that at the signature
level the BD transition drifts monotonically with bin width (Firm A
cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 ->
0.015; full-sample dHash transitions drift from 2 to 10 to 9 across
bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin
width, both characteristic of a histogram-resolution artifact. At the
accountant level the BD null is robust across the sweep. The paper's
earlier "three methodologically distinct estimators" framing therefore
could not be defended to an IEEE Access reviewer once the sweep was
run.

Added
- signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep
  across 6 variants (Firm A / full-sample / accountant-level, each
  cosine + dHash_indep) and 3-4 bin widths per variant. Reports
  Z_below, Z_above, p-values, and number of significant transitions
  per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}.
- paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width
  Sensitivity" with Table A.I (all 20 sensitivity cells) and
  interpretation linking the empirical pattern to the main-text
  framing decision.
- export_v3.py: appendix inserted into SECTIONS between conclusion
  and references.
- paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation
  captured verbatim for audit trail.

Main-text reframing
- Abstract: "three methodologically distinct estimators" ->
  "two estimators plus a Burgstahler-Dichev/McCrary density-
  smoothness diagnostic". Trimmed to 243 words.
- Introduction: related-work summary, pipeline step 5, accountant-
  level convergence sentence, contribution 4, and section-outline
  line all updated. Contribution 4 renamed to "Convergent threshold
  framework with a smoothness diagnostic".
- Methodology III-I: section renamed to "Convergent Threshold
  Determination with a Density-Smoothness Diagnostic". "Method 2:
  BD/McCrary Discontinuity" converted to "Density-Smoothness
  Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered
  to Method 2. Subsections 4 and 5 updated to refer to "two threshold
  estimators" with BD as diagnostic.
- Methodology III-A pipeline overview: "three methodologically
  distinct statistical methods" -> "two methodologically distinct
  threshold estimators complemented by a density-smoothness
  diagnostic".
- Methodology III-L: "three-method analysis" -> "accountant-level
  threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian
  robustness crossing)".
- Results IV-D.1 heading: "BD/McCrary Discontinuity" ->
  "BD/McCrary Density-Smoothness Diagnostic". Prose now notes the
  Appendix-A bin-width instability explicitly.
- Results IV-E: Table VIII restructured to label BD rows
  "(diagnostic only; bin-unstable)" and "(diagnostic; null across
  Appendix A)". Summary sentence rewritten to frame BD null as
  evidence for clustered-but-smoothly-mixed rather than as a
  convergence failure. Table cosine P5 row corrected from 0.941 to
  0.9407 to match III-K.
- Results IV-G.3 and IV-I.2: "three-method convergence/thresholds"
  -> "accountant-level convergent thresholds" (clarifies the 3
  converging estimates are KDE antimode, Beta-2, logit-Gaussian,
  not KDE/BD/Beta).
- Discussion V-B: "three-method framework" -> "convergent threshold
  framework".
- Conclusion: "three methodologically distinct methods" -> "two
  threshold estimators and a density-smoothness diagnostic";
  contribution 3 restated; future-work sentence updated.
- Impact Statement (archived): "three methodologically distinct
  threshold-selection methods" -> "two methodologically distinct
  threshold estimators plus a density-smoothness diagnostic" so the
  archived text is internally consistent if reused.

Discussion V-B / V-G already framed BD as a diagnostic in v3.5
(unchanged in this commit). The reframing therefore brings Abstract /
Introduction / Methodology / Results / Conclusion into alignment with
the Discussion framing that codex had already endorsed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:32:50 +08:00
gbanyan 6946baa096 Paper A v3.6: codex round-5 quick-wins cleanup (Minor Revision)
Codex gpt-5.4 round-5 (codex_review_gpt54_v3_5.md) verdict was Minor
Revision - all v3.4 round-4 PARTIAL/UNFIXED items now confirmed
RESOLVED, including line-by-line recomputation of Table XI z/p
matching the manuscript values. This commit cleans the remaining
quick-win items:

Table IX numerical sync to Script 24 authoritative values
- Five count corrections: cos>0.837 (60,405->60,408), cos>0.945
  (57,131/94.52% -> 56,836/94.02%, was 295 sigs / 0.50 pp off),
  cos>0.973 (48,910/80.91% -> 48,028/79.45%, was 882 sigs / 1.46 pp
  off), cos>0.95 (55,916->55,922), dh<=8 (57,521->57,527),
  dh<=15 (60,345->60,348), dual (54,373->54,370).
- Threshold label cos>0.941 -> cos>0.9407 (use exact calib-fold P5
  rather than rounded value).
- "dHash_indep <= 5 (calib-fold median-adjacent)" relabeled to
  "(whole-sample upper-tail of mode)" to match what III-L explains.
- Added "(operational dual)" / "(style-consistency boundary)" labels
  for unambiguous mapping into III-L category definitions.
- Removed circularity-language footnote inside the table comment.

Circularity overclaim removed paper-wide
- Methodology III-K (Section 3 anchor): "we break the resulting
  circularity" -> "we make the within-Firm-A sampling variance
  visible".
- Results IV-G.2 subsection title: "(breaks calibration-validation
  circularity)" -> "(within-Firm-A sampling variance disclosure)".
- Combined with the v3.5 Abstract / Conclusion edits, no surviving
  use of circular* anywhere in the paper.

export_v3.py title page now single-anonymized
- Removed "[Authors removed for double-blind review]" placeholder
  (IEEE Access uses single-anonymized review).
- Replaced with explicit "[AUTHOR NAMES - fill in before submission]"
  + affiliation placeholder so the requirement is unmissable.
- Subtitle now reads "single-anonymized review".

III-G stale "cosine-conditional dHash" sentence removed
- After the v3.5 III-L rewrite to dh_indep, the sentence at
  Methodology L131 referencing "cosine-conditional dHash used as a
  diagnostic elsewhere" no longer described any current paper usage.
- Replaced with a positive statement that dh_indep is the dHash
  statistic used throughout the operational classifier and all
  reported capture-rate analyses.

Abstract trimmed 247 -> 242 words for IEEE 250-word safety margin
- "an end-to-end pipeline" -> "a pipeline"; "Unlike signature
  forgery" -> "Unlike forgery"; "we report" passive recast; small
  conjunction trims.

Outstanding items deferred (require user decision / larger scope):
- BD/McCrary either substantiate (Z/p table + bin-width robustness)
  or demote to supplementary diagnostic.
- Visual-inspection protocol disclosure (sample size, rater count,
  blinding, adjudication rule).
- Reproducibility appendix (VLM prompt, HSV thresholds, seeds, EM
  init / stopping / boundary handling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 12:41:11 +08:00
gbanyan 12f716ddf1 Paper A v3.5: resolve codex round-4 residual issues
Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):

Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
  table had 1-4-unit transcription errors in k values and a fabricated
  cos > 0.9407 calibration row; both fixed by rerunning Script 24
  with cos = 0.9407 added to COS_RULES and copying exact values from
  the JSON output.
- Section III-L classifier now defined entirely in terms of the
  independent-minimum dHash statistic that the deployed code (Scripts
  21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
  language is removed. Tables IX, XI, XII, XVI are now arithmetically
  consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
  III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
  per-signature cosine distribution, matching III-L and IV-F.

Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
  limit. Removed "we break the circularity" overclaim; replaced with
  "report capture rates on both folds with Wilson 95% intervals to
  make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
  within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
  Methods/Results don't deliver; replaced with anchor-based capture /
  FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
  intra-report consistency (IV-H.3) is a different test (two co-signers
  on the same report, firm-level homogeneity) and is not a within-CPA
  year-level mixing check; the assumption is maintained as a bounded
  identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
  the partner-level ranking is threshold-free"; longitudinal-stability
  uses 0.95 cutoff, intra-report uses the operational classifier.

Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
  Regular Papers do not have a standalone Impact Statement). The file
  itself is retained as an archived non-paper note for cover-letter /
  grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
  signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
  [35] VLM survey, [36] Mann-Whitney) are now cited in-text:
    [27] in Methodology III-E (dHash definition)
    [31][32][33] in Introduction (audit-quality regulation context)
    [34][35] in Methodology III-C/III-D
    [36] in Results IV-C (Mann-Whitney result)

Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 12:23:03 +08:00
gbanyan 0ff1845b22 Paper A v3.4: resolve codex round-3 major-revision blockers
Three blockers from codex gpt-5.4 round-3 review (codex_review_gpt54_v3_3.md):

B1 Classifier vs three-method threshold mismatch
  - Methodology III-L rewritten to make explicit that the per-signature
    classifier and the accountant-level three-method convergence operate
    at different units (signature vs accountant) and are complementary
    rather than substitutable.
  - Add Results IV-G.3 + Table XII operational-threshold sensitivity:
    cos>0.95 vs cos>0.945 shifts dual-rule capture by 1.19 pp on whole
    Firm A; ~5% of signatures flip at the Uncertain/Moderate boundary.

B2 Held-out validation false "within Wilson CI" claim
  - Script 24 recomputes both calibration-fold and held-out-fold rates
    with Wilson 95% CIs and a two-proportion z-test on each rule.
  - Table XI replaced with the proper fold-vs-fold comparison; prose
    in Results IV-G.2 and Discussion V-C corrected: extreme rules agree
    across folds (p>0.7); operational rules in the 85-95% band differ
    by 1-5 pp due to within-Firm-A heterogeneity (random 30% sample
    contained more high-replication C1 accountants), not generalization
    failure.

B3 Interview evidence reframed as practitioner knowledge
  - The Firm A "interviews" referenced throughout v3.3 are private,
    informal professional conversations, not structured research
    interviews. Reframed accordingly: all "interview*" references in
    abstract / intro / methodology / results / discussion / conclusion
    are replaced with "domain knowledge / industry-practice knowledge".
  - This avoids overclaiming methodological formality and removes the
    human-subjects research framing that triggered the ethics-statement
    requirement.
  - Section III-H four-pillar Firm A validation now stands on visual
    inspection, signature-level statistics, accountant-level GMM, and
    the three Section IV-H analyses, with practitioner knowledge as
    background context only.
  - New Section III-M ("Data Source and Firm Anonymization") covers
    MOPS public-data provenance, Firm A/B/C/D pseudonymization, and
    conflict-of-interest declaration.

Add signature_analysis/24_validation_recalibration.py for the recomputed
calib-vs-held-out z-tests and the classifier sensitivity analysis;
output in reports/validation_recalibration/.

Pending (not in this commit): abstract length (368 -> 250 words),
Impact Statement removal, BD/McCrary sensitivity reporting, full
reproducibility appendix, references cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 11:45:24 +08:00
gbanyan 5717d61dd4 Paper A v3.3: apply codex v3.2 peer-review fixes
Codex (gpt-5.4) second-round review recommended 'minor revision'. This
commit addresses all issues flagged in that review.

## Structural fixes

- dHash calibration inconsistency (codex #1, most important):
  Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come
  from the whole-sample Firm A cosine-conditional dHash distribution
  (median=5, P95=15), not from the calibration-fold independent-minimum
  dHash distribution (median=2, P95=9) which we report elsewhere as
  descriptive anchors. Added explicit note about the two dHash
  conventions and their relationship.

- Section IV-H framing (codex #2):
  Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence"
  to "Additional Firm A Benchmark Validation" and clarified in the
  section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully
  threshold-free, H.3 uses the calibrated classifier. H.3's concluding
  sentence now says "the substantive evidence lies in the cross-firm
  gap" rather than claiming the test is threshold-free.

- Table XVI 93,979 typo fixed (codex #3):
  Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm).

- Held-out Firm A denominator 124+54=178 vs 180 (codex #4):
  Added explicit note that 2 CPAs were excluded due to disambiguation
  ties in the CPA registry.

- Table VIII duplication (codex #5):
  Removed the duplicate accountant-level-only Table VIII comment; the
  comprehensive cross-level Table VIII subsumes it. Text now says
  "accountant-level rows of Table VIII (below)".

- Anonymization broken in Tables XIV-XVI (codex #6):
  Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/
  "Firm D" across Tables XIV, XV, XVI. Table and caption language
  updated accordingly.

- Table X unit mismatch (codex #7):
  Dropped precision, recall, F1 columns. Table now reports FAR
  (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR
  (against the byte-identical positive anchor). III-K and IV-G.1 text
  updated to justify the change.

## Sentence-level fixes

- "three independent statistical methods" in Methodology III-A ->
  "three methodologically distinct statistical methods".
- "three independent methods" in Conclusion -> "three methodologically
  distinct methods".
- Abstract "~0.006 converging" now explicitly acknowledges that
  BD/McCrary produces no significant accountant-level discontinuity.
- Conclusion ditto.
- Discussion limitation sentence "BD/McCrary should be interpreted at
  the accountant level for threshold-setting purposes" rewritten to
  reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold
  estimator, at the accountant level.
- III-H "two analyses" -> "three analyses" (H.1 longitudinal stability,
  H.2 partner ranking, H.3 intra-report consistency).
- Related Work White 1982 overclaim rewritten: "consistent estimators
  of the pseudo-true parameter that minimizes KL divergence" replaces
  "guarantees asymptotic recovery".
- III-J "behavior is close to discrete" -> "practice is clustered".
- IV-D.2 pivot sentence "discreteness of individual behavior yields
  bimodality" -> "aggregation over signatures reveals clustered (though
  not sharply discrete) patterns".

Target journal remains IEEE Access. Output:
Paper_A_IEEE_Access_Draft_v3.docx (395 KB).

Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00
gbanyan 51d15b32a5 Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements;
partner confirmed the 2013-2019 restriction was an error (sample stays
2013-2023). The remaining suggestions are adopted with our own data.

## New scripts
- Script 22 (partner ranking): ranks all Big-4 auditor-years by mean
  max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x
  concentration ratio. Stable across 2013-2023 (88-100% per year).
- Script 23 (intra-report consistency): for each 2-signer report,
  classify both signatures and check agreement. Firm A agrees 89.9%
  vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers
  non-hand-signed; only 4 reports (0.01%) both hand-signed.

## New methodology additions
- III-G: explicit within-auditor-year no-mixing identification
  assumption (supported by Firm A interview evidence).
- III-H: 4th Firm A validation line: threshold-independent evidence
  from partner ranking + intra-report consistency.

## New results section IV-H (threshold-independent validation)
- IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%,
  2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts
  partner's hypothesis that 2020+ electronic systems increase
  heterogeneity -- data shows opposite (electronic systems more
  consistent than physical stamping).
- IV-H.2: partner ranking top-K tables (pooled + year-by-year).
- IV-H.3: intra-report consistency per-firm table.

## Renumbering
- Section H (was Classification Results) -> I
- Section I (was Ablation) -> J
- Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year,
  intra-report), XVII = classification (was XII), XVIII = ablation
  (was XIII).

These threshold-independent analyses address the codex review concern
about circular validation by providing benchmark evidence that does not
depend on any threshold calibrated to Firm A itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:59:49 +08:00
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
gbanyan 9b11f03548 Paper A v3: full rewrite for IEEE Access with three-method convergence
Major changes from v2:

Terminology:
- "digitally replicated" -> "non-hand-signed" throughout (per partner v3
  feedback and to avoid implicit accusation)
- "Firm A near-universal non-hand-signing" -> "replication-dominated"
  (per interview nuance: most but not all Firm A partners use replication)

Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list)

New methodological sections (III.G-III.L + IV.D-IV.G):
- Three convergent threshold methods (KDE antimode + Hartigan dip test /
  Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM
  robustness check)
- Explicit unit-of-analysis discussion (signature vs accountant)
- Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically)
- Pixel-identity validation anchor (no manual annotation needed)
- Low-similarity negative anchor + Firm A replication-dominated anchor

New empirical findings integrated:
- Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority
  hand-signers
- Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp
  mixture) - signature-level is continuous quality spectrum
- Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141,
  C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10
- Pixel-identity anchor (310 pairs) gives perfect recall at all cosine
  thresholds
- Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95%

New discussion section V.B: "Continuous-quality spectrum vs discrete-
behavior regimes" - the core interpretive contribution of v3.

References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997,
McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41).

export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2
from expanded methodology + results sections).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 00:14:47 +08:00
gbanyan 158f63efb2 Add Paper A drafts and docx export script
- export_paper_to_docx.py: build script combining paper_a_*.md sections into docx
- Paper_A_IEEE_TAI_Draft_20260403.docx: intermediate draft before AI review rounds
- Paper_A_IEEE_TAI_Draft_v2.docx: current draft after 3 AI reviews (GPT-5.4, Opus 4.6, Gemini 3 Pro) and Firm A recalibration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:34:31 +08:00
gbanyan 939a348da4 Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification
Paper draft includes all sections (Abstract through Conclusion), 36 references,
and supporting scripts. Key methodology: Cosine similarity + dHash dual-method
verification with thresholds calibrated against known-replication firm (Firm A).

Includes:
- 8 section markdown files (paper_a_*.md)
- Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0)
- Recalibrated classification script (84,386 PDFs, 5-tier system)
- Figure generation and Word export scripts
- Citation renumbering script ([1]-[36])
- Signature analysis pipeline (12 steps)
- YOLO extraction scripts

Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:05:33 +08:00