Commit Graph

9 Commits

Author SHA1 Message Date
gbanyan 6b64eabbfb Paper A v3.18.4: address codex GPT-5.5 round-18 self-comparing review findings
Codex round-18 (paper/codex_review_gpt55_v3_18_3.md) caught a falsified
provenance claim I introduced in v3.18.3 plus four cleaner narrative items
that survived the prior 17 rounds. Verdict was Minor Revision; this
commit closes all 5 actionable items.

- Harmonize signature_analysis/28_byte_identity_decomposition.py to use
  accountants.firm (joined on signatures.assigned_accountant) for Firm A
  membership, matching the convention in 24_validation_recalibration.py.
  Regenerated reports/byte_identity_decomp/byte_identity_decomposition.json.
  Cross-firm convergence now reports Firm A 49,389 / 55,922 = 88.32% and
  Non-Firm-A 27,595 / 65,514 = 42.12% (percentages unchanged at two
  decimal places; counts now match Table IX exactly).
- Replace the Section IV-H.2 reconciliation note. The previous note
  speculated that the one-record discrepancy was a snapshot/floating-point
  artifact, which codex round-18 falsified by direct DB queries: the real
  cause was that script 28 used signatures.excel_firm while Table IX uses
  accountants.firm. With script 28 now harmonized, Table IX and the
  cross-firm artifact agree exactly at 55,922; the new note documents the
  Firm A grouping convention plus the dHash-non-null filter.
- Replace residual "known-majority-positive" wording with
  "replication-dominated" in Introduction (contributions 4 and 6) and
  Methodology III-I (anchor-rationale paragraph).
- Correct Methodology III-G's auditor-year description: the per-signature
  best-match cosine that feeds each auditor-year mean is computed against
  the full same-CPA cross-year pool, not within-year only. The aggregation
  unit is within-year, but the underlying similarity statistic is not.
- Add the 145 / 50 / 180 / 35 Firm A byte-decomposition sentence to
  Results IV-F.1 with explicit pointer to script 28 and the JSON artifact;
  this resolves the round-18 finding that several manuscript locations
  cited IV-F.1 for a decomposition that was not actually reported there.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:59:07 +08:00
gbanyan f1c253768a Paper A v3.18.3: address codex GPT-5.5 round-17 self-comparing review findings
Codex round-17 (paper/codex_review_gpt55_v3_18_2.md) re-audited v3.18.2 and
flagged three new issues introduced by the v3.18.2 edits themselves plus
items it had partially RESOLVED but not fully cleaned up. Verdict still
Minor Revision; this commit closes the new findings.

- Fix Appendix B provenance paths: replace four fabricated paths
  (formal_statistical/*, deloitte_distribution/*, pdf_level/*, ablation/*)
  with the actual artifact paths verified in the local report tree.
- Acknowledge that the report tree is at /Volumes/NV2/PDF-Processing/...
  and reviewers should rebase to their own report root rather than rely on
  absolute paths.
- Remove residual "single dominant mechanism" wording from Methodology
  III-H (third primary evidence sentence) and Discussion V-C.
- Fix Methodology III-H Hartigan dip-test parenthetical: "p = 0.17 at
  n >= 10 signatures" wrongly attached the accountant-level filter to the
  signature-level dip; corrected to "p = 0.17, N = 60,448 Firm A
  signatures".
- Soften Introduction Firm A motivation: replace "widely recognized
  within the audit profession as making substantial use of non-hand-signing
  for the majority of its certifying partners" with a methodology-first
  framing that defers to the image evidence reported in the paper.
- Soften Methodology III-H "widely held within the audit profession"
  wording (kept as motivation, marked clearly as non-load-bearing in the
  next sentence).
- Reconcile 55,921 vs 55,922 Firm A cosine-only counts in Section IV-H.2:
  document explicitly that the one-record drift comes from successive DB
  snapshots used to materialize Table IX vs the new script-28 artifact;
  no rate at two decimal places is affected.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:45:54 +08:00
gbanyan 4bb7aa9189 Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings
Codex independent peer review (paper/codex_review_gpt55_v3_18_1.md) audited
empirical claims against scripts/JSON reports rather than rubber-stamping
prior Accept verdicts. Verdict: Minor Revision. This commit addresses every
flagged item.

- Soften mechanism-identification language (Results IV-D.1, Discussion B):
  per-signature cosine "fails to reject unimodality" rather than "reflects a
  single dominant generative mechanism"; framing tied to joint evidence.
- Replace overabsolute "single stored image" with multi-template phrasing
  in Introduction and Methodology III-A.
- Reframe Methodology III-H so practitioner knowledge is non-load-bearing;
  evidentiary basis is the paper's own image evidence.
- Fix stale section cross-references after the v3.18 retitling: IV-F.* ->
  IV-G.* in 11 locations across methodology and results.
- Fix 0.941 / 0.945 / 0.9407 wording in Methodology III-K to use the
  calibration-fold P5 = 0.9407 and the rounded sensitivity cut 0.945.
- Soften "sharp discontinuity" in Results IV-G.3 to "23-28 percentage-point
  gap consistent with firm-wide non-hand-signing practice".
- Soften Conclusion's "directly generalizable" with explicit conditions on
  analogous anchors and artifact-generation physics.
- Add Appendix B: table-to-script provenance map (15 manuscript tables
  mapped to generating scripts and JSON report artifacts).
- New script signature_analysis/28_byte_identity_decomposition.py produces
  reproducible artifacts for two previously-unverified claims:
  (a) 145 / 50 / 180 / 35 Firm A byte-identity decomposition (verified);
  (b) cross-firm dual-descriptor convergence -- corrected from the previous
      manuscript text "non-Firm-A 11.3% vs Firm A 58.7% (5x)" to the
      database-verified "non-Firm-A 42.12% vs Firm A 88.32% (~2.1x)".
- Clarify scripts 19 / 21 docstrings: legacy EER / FRR / Precision / F1
  helpers are retained for diagnostic use only and are NOT cited as
  biometric performance in the paper. Remove "interview evidence" wording.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:23:08 +08:00
gbanyan cb77f481ec Paper A v3.18.1: address remaining partner red-pen prose clarity items
Three targeted fixes per partner's red-pen audit (residue from v3.18 cleanup):

1. III-D 92.6% match rate -- partner red-circled the bare figure ("不太懂改善線").
   Add explicit explanation of the unmatched 7.4% (13,573 signatures): they
   could not be matched to a registered CPA name (deviation from two-signature
   layout, OCR-name mismatch) and are excluded from same-CPA pairwise analyses
   for definitional reasons, not discarded as noise.

2. III-I.1 Hartigan dip-test wording -- partner wrote "?所以為何?" next to the
   "rejecting unimodality is consistent with but does not directly establish
   bimodality" sentence. Replace with a direct three-line explanation: the
   test asks "is the distribution single-peaked?", a non-significant p means
   we cannot reject single-peak, a significant p means more than one peak
   (could be 2/3/...). Removes the partner's confusion without losing rigor.

3. IV-G validation lead-in -- partner wrote "不太懂為何陳述?" on the
   tangled "consistency check / threshold-free / operational classifier"
   triple. Rewrite as a three-bullet structure that names the *informative
   quantity* in each subsection (temporal trend / concentration ratio /
   cross-firm gap) and states explicitly why each is robust to cutoff choice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:48:59 +08:00
gbanyan 16e90bab20 Paper A v3.18: remove accountant-level + replication-dominated calibration + Gemini 2.5 Pro review minor fixes
Major changes (per partner red-pen + user decision):
- Delete entire accountant-level analysis (III.J, IV.E, Tables VI/VII/VIII,
  Fig 4) -- cross-year pooling assumption unjustified, removes the implicit
  "habitually stamps = always stamps" reading.
- Renumber sections III.J/K/L (was K/L/M) and IV.E/F/G/H/I (was F/G/H/I/J).
- Title: "Three-Method Convergent Thresholding" -> "Replication-Dominated
  Calibration" (the three diagnostics do NOT converge at signature level).
- Operational cosine cut anchored on whole-sample Firm A P7.5 (cos > 0.95).
- Three statistical diagnostics (Hartigan/Beta/BD-McCrary) reframed as
  descriptive characterisation, not threshold estimators.
- Firm A replication-dominated framing: 3 evidence strands -> 2.
- Discussion limitation list: drop accountant-level cross-year pooling and
  BD/McCrary diagnostic; add auditor-year longitudinal tracking as future work.
- Tone-shift: "we do not claim / do not derive" -> "we find / motivates".

Reference verification (independent web-search audit of all 41 refs):
- Fix [5] author hallucination: Hadjadj et al. -> Kao & Wen (real authors of
  Appl. Sci. 10:11:3716; report at paper/reference_verification_v3.md).
- Polish [16] [21] [22] [25] (year/volume/page-range/model-name).

Gemini 2.5 Pro peer review (Minor Revision verdict, A-F all positive):
- Neutralize script-path references in tables/appendix -> "supplementary
  materials".
- Move conflict-of-interest declaration from III-L to new Declarations
  section before References (paper_a_declarations_v3.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:43:09 +08:00
gbanyan 5717d61dd4 Paper A v3.3: apply codex v3.2 peer-review fixes
Codex (gpt-5.4) second-round review recommended 'minor revision'. This
commit addresses all issues flagged in that review.

## Structural fixes

- dHash calibration inconsistency (codex #1, most important):
  Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come
  from the whole-sample Firm A cosine-conditional dHash distribution
  (median=5, P95=15), not from the calibration-fold independent-minimum
  dHash distribution (median=2, P95=9) which we report elsewhere as
  descriptive anchors. Added explicit note about the two dHash
  conventions and their relationship.

- Section IV-H framing (codex #2):
  Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence"
  to "Additional Firm A Benchmark Validation" and clarified in the
  section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully
  threshold-free, H.3 uses the calibrated classifier. H.3's concluding
  sentence now says "the substantive evidence lies in the cross-firm
  gap" rather than claiming the test is threshold-free.

- Table XVI 93,979 typo fixed (codex #3):
  Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm).

- Held-out Firm A denominator 124+54=178 vs 180 (codex #4):
  Added explicit note that 2 CPAs were excluded due to disambiguation
  ties in the CPA registry.

- Table VIII duplication (codex #5):
  Removed the duplicate accountant-level-only Table VIII comment; the
  comprehensive cross-level Table VIII subsumes it. Text now says
  "accountant-level rows of Table VIII (below)".

- Anonymization broken in Tables XIV-XVI (codex #6):
  Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/
  "Firm D" across Tables XIV, XV, XVI. Table and caption language
  updated accordingly.

- Table X unit mismatch (codex #7):
  Dropped precision, recall, F1 columns. Table now reports FAR
  (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR
  (against the byte-identical positive anchor). III-K and IV-G.1 text
  updated to justify the change.

## Sentence-level fixes

- "three independent statistical methods" in Methodology III-A ->
  "three methodologically distinct statistical methods".
- "three independent methods" in Conclusion -> "three methodologically
  distinct methods".
- Abstract "~0.006 converging" now explicitly acknowledges that
  BD/McCrary produces no significant accountant-level discontinuity.
- Conclusion ditto.
- Discussion limitation sentence "BD/McCrary should be interpreted at
  the accountant level for threshold-setting purposes" rewritten to
  reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold
  estimator, at the accountant level.
- III-H "two analyses" -> "three analyses" (H.1 longitudinal stability,
  H.2 partner ranking, H.3 intra-report consistency).
- Related Work White 1982 overclaim rewritten: "consistent estimators
  of the pseudo-true parameter that minimizes KL divergence" replaces
  "guarantees asymptotic recovery".
- III-J "behavior is close to discrete" -> "practice is clustered".
- IV-D.2 pivot sentence "discreteness of individual behavior yields
  bimodality" -> "aggregation over signatures reveals clustered (though
  not sharply discrete) patterns".

Target journal remains IEEE Access. Output:
Paper_A_IEEE_Access_Draft_v3.docx (395 KB).

Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 02:32:17 +08:00
gbanyan 51d15b32a5 Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements;
partner confirmed the 2013-2019 restriction was an error (sample stays
2013-2023). The remaining suggestions are adopted with our own data.

## New scripts
- Script 22 (partner ranking): ranks all Big-4 auditor-years by mean
  max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x
  concentration ratio. Stable across 2013-2023 (88-100% per year).
- Script 23 (intra-report consistency): for each 2-signer report,
  classify both signatures and check agreement. Firm A agrees 89.9%
  vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers
  non-hand-signed; only 4 reports (0.01%) both hand-signed.

## New methodology additions
- III-G: explicit within-auditor-year no-mixing identification
  assumption (supported by Firm A interview evidence).
- III-H: 4th Firm A validation line: threshold-independent evidence
  from partner ranking + intra-report consistency.

## New results section IV-H (threshold-independent validation)
- IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%,
  2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts
  partner's hypothesis that 2020+ electronic systems increase
  heterogeneity -- data shows opposite (electronic systems more
  consistent than physical stamping).
- IV-H.2: partner ranking top-K tables (pooled + year-by-year).
- IV-H.3: intra-report consistency per-firm table.

## Renumbering
- Section H (was Classification Results) -> I
- Section I (was Ablation) -> J
- Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year,
  intra-report), XVII = classification (was XII), XVIII = ablation
  (was XIII).

These threshold-independent analyses address the codex review concern
about circular validation by providing benchmark evidence that does not
depend on any threshold calibrated to Firm A itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:59:49 +08:00
gbanyan 9d19ca5a31 Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21
Major fixes per codex (gpt-5.4) review:

## Structural fixes
- Fixed three-method convergence overclaim: added Script 20 to run KDE
  antimode, BD/McCrary, and Beta mixture EM on accountant-level means.
  Accountant-level 1D convergence: KDE antimode=0.973, Beta-2=0.979,
  LogGMM-2=0.976 (within ~0.006). BD/McCrary finds no transition at
  accountant level (consistent with smooth clustering, not sharp
  discontinuity).
- Disambiguated Method 1: KDE crossover (between two labeled distributions,
  used at signature all-pairs level) vs KDE antimode (single-distribution
  local minimum, used at accountant level).
- Addressed Firm A circular validation: Script 21 adds CPA-level 70/30
  held-out fold. Calibration thresholds derived from 70% only; heldout
  rates reported with Wilson 95% CIs (e.g. cos>0.95 heldout=93.61%
  [93.21%-93.98%]).
- Fixed 139+32 vs 180: the split is 139/32 of 171 Firm A CPAs with >=10
  signatures (9 CPAs excluded for insufficient sample). Reconciled across
  intro, results, discussion, conclusion.
- Added document-level classification aggregation rule (worst-case signature
  label determines document label).

## Pixel-identity validation strengthened
- Script 21: built ~50,000-pair inter-CPA random negative anchor (replaces
  the original n=35 same-CPA low-similarity negative which had untenable
  Wilson CIs).
- Added Wilson 95% CI for every FAR in Table X.
- Proper EER interpolation (FAR=FRR point) in Table X.
- Softened "conservative recall" claim to "non-generalizable subset"
  language per codex feedback (byte-identical positives are a subset, not
  a representative positive class).
- Added inter-CPA stats: mean=0.762, P95=0.884, P99=0.913.

## Terminology & sentence-level fixes
- "statistically independent methods" -> "methodologically distinct methods"
  throughout (three diagnostics on the same sample are not independent).
- "formal bimodality check" -> "unimodality test" (dip test tests H0 of
  unimodality; rejection is consistent with but not a direct test of
  bimodality).
- "Firm A near-universally non-hand-signed" -> already corrected to
  "replication-dominated" in prior commit; this commit strengthens that
  framing with explicit held-out validation.
- "discrete-behavior regimes" -> "clustered accountant-level heterogeneity"
  (BD/McCrary non-transition at accountant level rules out sharp discrete
  boundaries; the defensible claim is clustered-but-smooth).
- Softened White 1982 quasi-MLE claim (no longer framed as a guarantee).
- Fixed VLM 1.2% FP overclaim (now acknowledges the 1.2% could be VLM FP
  or YOLO FN).
- Unified "310 byte-identical signatures" language across Abstract,
  Results, Discussion (previously alternated between pairs/signatures).
- Defined min_dhash_independent explicitly in Section III-G.
- Fixed table numbering (Table XI heldout added, classification moved to
  XII, ablation to XIII).
- Explained 84,386 vs 85,042 gap (656 docs have only one signature, no
  pairwise stat).
- Made Table IX explicitly a "consistency check" not "validation"; paired
  it with Table XI held-out rates as the genuine external check.
- Defined 0.941 threshold (calibration-fold Firm A cosine P5).
- Computed 0.945 Firm A rate exactly (94.52%) instead of interpolated.
- Fixed Ref [24] Qwen2.5-VL to full IEEE format (arXiv:2502.13923).

## New artifacts
- Script 20: accountant-level three-method threshold analysis
- Script 21: expanded validation (inter-CPA anchor, held-out Firm A 70/30)
- paper/codex_review_gpt54_v3.md: preserved review feedback

Output: Paper_A_IEEE_Access_Draft_v3.docx (391 KB, rebuilt from v3.1
markdown sources).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:11:51 +08:00
gbanyan 9b11f03548 Paper A v3: full rewrite for IEEE Access with three-method convergence
Major changes from v2:

Terminology:
- "digitally replicated" -> "non-hand-signed" throughout (per partner v3
  feedback and to avoid implicit accusation)
- "Firm A near-universal non-hand-signing" -> "replication-dominated"
  (per interview nuance: most but not all Firm A partners use replication)

Target journal: IEEE TAI -> IEEE Access (per NCKU CSIE list)

New methodological sections (III.G-III.L + IV.D-IV.G):
- Three convergent threshold methods (KDE antimode + Hartigan dip test /
  Burgstahler-Dichev McCrary / EM-fitted Beta mixture + logit-GMM
  robustness check)
- Explicit unit-of-analysis discussion (signature vs accountant)
- Accountant-level 2D Gaussian mixture (BIC-best K=3 found empirically)
- Pixel-identity validation anchor (no manual annotation needed)
- Low-similarity negative anchor + Firm A replication-dominated anchor

New empirical findings integrated:
- Firm A signature cosine UNIMODAL (dip p=0.17) - long left tail = minority
  hand-signers
- Full-sample cosine MULTIMODAL but not cleanly bimodal (BIC prefers 3-comp
  mixture) - signature-level is continuous quality spectrum
- Accountant-level mixture trimodal (C1 Deloitte-heavy 139/141,
  C2 other Big-4, C3 smaller firms). 2-comp crossings cos=0.945, dh=8.10
- Pixel-identity anchor (310 pairs) gives perfect recall at all cosine
  thresholds
- Firm A anchor rates: cos>0.95=92.5%, dual-rule cos>0.95 AND dh<=8=89.95%

New discussion section V.B: "Continuous-quality spectrum vs discrete-
behavior regimes" - the core interpretive contribution of v3.

References added: Hartigan & Hartigan 1985, Burgstahler & Dichev 1997,
McCrary 2008, Dempster-Laird-Rubin 1977, White 1982 (refs 37-41).

export_v3.py builds Paper_A_IEEE_Access_Draft_v3.docx (462 KB, +40% vs v2
from expanded methodology + results sections).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 00:14:47 +08:00