Commit Graph

6 Commits

Author SHA1 Message Date
gbanyan 51d15b32a5 Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements;
partner confirmed the 2013-2019 restriction was an error (sample stays
2013-2023). The remaining suggestions are adopted with our own data.

## New scripts
- Script 22 (partner ranking): ranks all Big-4 auditor-years by mean
  max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x
  concentration ratio. Stable across 2013-2023 (88-100% per year).
- Script 23 (intra-report consistency): for each 2-signer report,
  classify both signatures and check agreement. Firm A agrees 89.9%
  vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers
  non-hand-signed; only 4 reports (0.01%) both hand-signed.

## New methodology additions
- III-G: explicit within-auditor-year no-mixing identification
  assumption (supported by Firm A interview evidence).
- III-H: 4th Firm A validation line: threshold-independent evidence
  from partner ranking + intra-report consistency.

## New results section IV-H (threshold-independent validation)
- IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%,
  2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts
  partner's hypothesis that 2020+ electronic systems increase
  heterogeneity -- data shows opposite (electronic systems more
  consistent than physical stamping).
- IV-H.2: partner ranking top-K tables (pooled + year-by-year).
- IV-H.3: intra-report consistency per-firm table.

## Renumbering
- Section H (was Classification Results) -> I
- Section I (was Ablation) -> J
- Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year,
  intra-report), XVII = classification (was XII), XVIII = ablation
  (was XIII).

These threshold-independent analyses address the codex review concern
about circular validation by providing benchmark evidence that does not
depend on any threshold calibrated to Firm A itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:59:49 +08:00
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
gbanyan 68689c9f9b Correct Firm A framing: replication-dominated, not pure
Interview evidence from multiple Firm A accountants confirms that MOST
use replication (stamping / firm-level e-signing) but a MINORITY may
still hand-sign. Firm A is therefore a "replication-dominated" population,
not a "pure" one. This framing is consistent with:

- 92.5% of Firm A signatures exceed cosine 0.95 (majority replication)
- The long left tail (~7%) captures the minority hand-signers, not scan
  noise or preprocessing artifacts
- Hartigan dip test: Firm A cosine unimodal long-tail (p=0.17)
- Accountant-level GMM: of 180 Firm A accountants, 139 cluster in C1
  (high-replication) and 32 in C2 (middle band = minority hand-signers)

Updates docstrings and report text in Scripts 15, 16, 18, 19 to match.
Partner v3's "near-universal non-hand-signing" language corrected.

Script 19 regenerated with the updated text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:57:16 +08:00
gbanyan fbfab1fa68 Add three-convergent-method threshold scripts + pixel-identity validation
Implements Partner v3's statistical rigor requirements at the level of
signature vs. accountant analysis units:

- Script 15 (Hartigan dip test): formal unimodality test via `diptest`.
  Result: Firm A cosine UNIMODAL (p=0.17, pure non-hand-signed population);
  full-sample cosine MULTIMODAL (p<0.001, mix of two regimes);
  accountant-level aggregates MULTIMODAL on both cos and dHash.

- Script 16 (Burgstahler-Dichev / McCrary): discretised Z-score transition
  detection. Firm A and full-sample cosine transitions at 0.985; dHash
  at 2.0.

- Script 17 (Beta mixture EM + logit-GMM): 2/3-component Beta via EM
  with MoM M-step, plus parallel Gaussian mixture on logit transform
  as White (1982) robustness check. Beta-3 BIC < Beta-2 BIC at signature
  level confirms 2-component is a forced fit -- supporting the pivot
  to accountant-level mixture.

- Script 18 (Accountant-level GMM): rebuilds the 2026-04-16 analysis
  that was done inline and not saved. BIC-best K=3 with components
  matching prior memory almost exactly: C1 (cos=0.983, dh=2.41, 20%,
  Deloitte 139/141), C2 (0.954, 6.99, 51%, KPMG/PwC/EY), C3 (0.928,
  11.17, 28%, small firms). 2-component natural thresholds:
  cos=0.9450, dh=8.10.

- Script 19 (Pixel-identity validation): no human annotation needed.
  Uses pixel_identical_to_closest (310 sigs) as gold positive and
  Firm A as anchor positive. Confirms Firm A cosine>0.95 = 92.51%
  (matches prior 2026-04-08 finding of 92.5%), dual rule
  cos>0.95 AND dhash_indep<=8 captures 89.95% of Firm A.

Python deps added: diptest, scikit-learn (installed into venv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:51:41 +08:00
gbanyan a261a22bd2 Add Deloitte distribution & independent dHash analysis scripts
- Script 13: Firm A normality/multimodality analysis (Shapiro-Wilk, Anderson-Darling, KDE, per-accountant ANOVA, Beta/Gamma fitting)
- Script 14: Independent min-dHash computation across all pairs per accountant (not just cosine-nearest pair)
- THRESHOLD_VALIDATION_OPTIONS: 2026-01 discussion doc on threshold validation approaches
- .gitignore: exclude model weights, node artifacts, and xlsx data

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:34:24 +08:00
gbanyan 939a348da4 Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification
Paper draft includes all sections (Abstract through Conclusion), 36 references,
and supporting scripts. Key methodology: Cosine similarity + dHash dual-method
verification with thresholds calibrated against known-replication firm (Firm A).

Includes:
- 8 section markdown files (paper_a_*.md)
- Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0)
- Recalibrated classification script (84,386 PDFs, 5-tier system)
- Figure generation and Word export scripts
- Citation renumbering script ([1]-[36])
- Signature analysis pipeline (12 steps)
- YOLO extraction scripts

Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:05:33 +08:00