pdf_signature_extraction/signature_analysis at bc36dcc2b64eb71ae6cac11e0ddf80cc55ce34ea - pdf_signature_extraction - Gitea

gbanyan/pdf_signature_extraction

Files

T

History

gbanyan bc36dcc2b6 Add script 38: v4.0 convergence (CONVERGENCE_STRONG, three lenses agree)

Phase 1.6 (G2 path) script. Tests whether three INDEPENDENT
statistical approaches converge on the same Big-4 CPA ranking:

  1. K=3 GMM cluster posterior P_C1 (hand-leaning)
     -- from full Big-4 K=3 fit (Script 37 baseline).
  2. Reverse-anchor directional score
     -- non-Big-4 (n=249, mid/small firms only) as the
        reference Gaussian; -cos_left_tail_pct as score.
     -- Strict separation: no Big-4 CPA in the reference.
  3. Paper A v3.x operational rule per-CPA hand_frac
     -- (cos > 0.95 AND dh <= 5) failure rate per CPA.

Pairwise Spearman correlations:

  p_c1 vs paperA_hand_frac           rho = +0.9627  (p < 1e-248)
  reverse_anchor vs paperA_hand_frac rho = +0.8890  (p < 1e-149)
  p_c1 vs reverse_anchor             rho = +0.8794  (p < 1e-142)

Verdict: CONVERGENCE_STRONG (all 3 |rho| >= 0.7).

Per-firm consistency across lenses:

  Firm    n     C1%      C3%      E[P_C1]  E[rev]   E[hand]
  FirmA  171   0.00%   82.46%    0.007   -0.973    0.193
  KPMG   112   8.93%    0.00%    0.141   -0.820    0.696
  PwC    102  23.53%    0.98%    0.311   -0.767    0.790
  EY      52  11.54%    1.92%    0.241   -0.713    0.761

Same monotone ordering by all three metrics:
  Firm A < KPMG < EY ~= PwC on hand-leaning.

Implication for v4.0: methodology paper now has THREE
independent lines of evidence converging on the same population
structure -- a much harder thing for a reviewer to dismiss
than any single lens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 15:03:55 +08:00

..

01_init_database.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

02_extract_features.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

03_similarity_analysis.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

04_generate_visual_report.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

05_extract_names_full.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

05_extract_names.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

07_cleanup_and_assign.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

08_accountant_similarity_analysis.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

09_pdf_signature_verdict.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

10_formal_statistical_analysis.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

11_compute_ssim_phash.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

12_generate_pdf_level_report.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

13_deloitte_distribution_analysis.py

Add Deloitte distribution & independent dHash analysis scripts

2026-04-20 21:34:24 +08:00

14_compute_independent_dhash.py

Add Deloitte distribution & independent dHash analysis scripts

2026-04-20 21:34:24 +08:00

15_hartigan_dip_test.py

Correct Firm A framing: replication-dominated, not pure

2026-04-20 21:57:16 +08:00

16_bd_mccrary_discontinuity.py

Correct Firm A framing: replication-dominated, not pure

2026-04-20 21:57:16 +08:00

17_beta_mixture_em.py

Add three-convergent-method threshold scripts + pixel-identity validation

2026-04-20 21:51:41 +08:00

18_accountant_mixture.py

Correct Firm A framing: replication-dominated, not pure

2026-04-20 21:57:16 +08:00

19_pixel_identity_validation.py

Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings

2026-04-27 20:23:08 +08:00

20_accountant_level_three_methods.py

Paper A v3.1: apply codex peer-review fixes + add Scripts 20/21

2026-04-21 01:11:51 +08:00

21_expanded_validation.py

Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

2026-05-06 13:44:49 +08:00

22_partner_ranking.py

Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)

2026-04-21 01:59:49 +08:00

23_intra_report_consistency.py

Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)

2026-04-21 01:59:49 +08:00

24_validation_recalibration.py

Paper A v3.5: resolve codex round-4 residual issues

2026-04-21 12:23:03 +08:00

25_bd_mccrary_sensitivity.py

Paper A v3.7: demote BD/McCrary to density-smoothness diagnostic; add Appendix A

2026-04-21 14:32:50 +08:00

27_within_year_uniformity.py

Add script 27: within-auditor-year uniformity empirical check (A2 test)

2026-05-12 11:34:17 +08:00

28_byte_identity_decomposition.py

Paper A v3.18.4: address codex GPT-5.5 round-18 self-comparing review findings

2026-04-27 20:59:07 +08:00

29_firm_a_yearly_distribution.py

Paper A v3.19.0: address Gemini 3.1 Pro round-19 Major Revision findings

2026-04-27 21:40:42 +08:00

30_yearly_big4_comparison.py

Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

2026-05-06 13:44:49 +08:00

31_within_year_ranking_robustness.py

Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

2026-05-06 13:44:49 +08:00

32_non_firm_a_calibration.py

Add script 32: non-Firm-A calibration spike (verdict C with twist)

2026-05-12 12:05:18 +08:00

33_reverse_anchor_spike.py

Add script 33: reverse-anchor spike (PAPER_C_STRONG verdict)

2026-05-12 12:09:36 +08:00

34_big4_only_pooled_calibration.py

Add scripts 34 + 35: Big-4-only calibration foundation

2026-05-12 14:35:37 +08:00

35_big4_k3_cluster_names.py

Add scripts 34 + 35: Big-4-only calibration foundation

2026-05-12 14:35:37 +08:00

36_v4_calibration_and_loo.py

Add script 36: v4.0 calibration + LOOO validation (UNSTABLE verdict)

2026-05-12 14:54:54 +08:00

37_v4_k3_loo_check.py

Add script 37: K=3 LOOO check (P2_PARTIAL — v4.0 is salvageable with K=3)

2026-05-12 14:57:40 +08:00

38_v4_convergence_k3_and_reverse_anchor.py

Add script 38: v4.0 convergence (CONVERGENCE_STRONG, three lenses agree)

2026-05-12 15:03:55 +08:00

THRESHOLD_VALIDATION_OPTIONS.md

Add Deloitte distribution & independent dHash analysis scripts

2026-04-20 21:34:24 +08:00