diff --git a/paper/Paper_A_IEEE_Access_Draft_v3.docx b/paper/Paper_A_IEEE_Access_Draft_v3.docx index 91f52a0..bb8067f 100644 Binary files a/paper/Paper_A_IEEE_Access_Draft_v3.docx and b/paper/Paper_A_IEEE_Access_Draft_v3.docx differ diff --git a/paper/paper_a_appendix_v3.md b/paper/paper_a_appendix_v3.md index f1384bc..ed7ae2e 100644 --- a/paper/paper_a_appendix_v3.md +++ b/paper/paper_a_appendix_v3.md @@ -49,7 +49,7 @@ For reproducibility, the following table maps each numerical table in Section IV | Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` | | Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` | | Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` | -| Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | derived from `reports/accountant_similarity_analysis.json` filtered to Firm A; figures in `reports/figures/` | +| Table XIII (Firm A per-year cosine distribution) | `29_firm_a_yearly_distribution.py` | `reports/firm_a_yearly/firm_a_yearly_distribution.json` | | Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` | | Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` | | Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) | diff --git a/paper/paper_a_results_v3.md b/paper/paper_a_results_v3.md index 17096ad..0728e84 100644 --- a/paper/paper_a_results_v3.md +++ b/paper/paper_a_results_v3.md @@ -150,7 +150,7 @@ We report three validation analyses corresponding to the anchors of Section III- Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G. Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B). -As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$). +As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$). Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation. We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X. The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold. @@ -159,12 +159,12 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a @@ -178,7 +178,7 @@ The very low FAR at the operational cut is therefore informative about specifici ### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure) We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). -The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here. +The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule. Thresholds are re-derived from calibration-fold percentiles only. Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test. @@ -340,7 +340,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th ## H. Classification Results Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents. -The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here. +The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here. We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts. Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports. diff --git a/signature_analysis/21_expanded_validation.py b/signature_analysis/21_expanded_validation.py index eeb6e80..5aa37da 100644 --- a/signature_analysis/21_expanded_validation.py +++ b/signature_analysis/21_expanded_validation.py @@ -85,44 +85,78 @@ def load_signatures(): return rows -def load_feature_vectors_sample(n=2000): - """Load feature vectors for inter-CPA negative-anchor sampling.""" +def load_signature_ids_for_negative_pool(seed=SEED): + """Load lightweight (sig_id, accountant) pool from the entire matched + corpus. Per Gemini round-19 review, the prior implementation drew + 50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing + each signature ~33 times and artificially tightening Wilson FAR CIs. + The corrected implementation samples pairs i.i.d. across the FULL + matched corpus (~168k signatures); only the unique signatures that + actually appear in the sampled pairs need feature vectors loaded. + """ conn = sqlite3.connect(DB) cur = conn.cursor() cur.execute(''' - SELECT signature_id, assigned_accountant, feature_vector + SELECT signature_id, assigned_accountant FROM signatures WHERE feature_vector IS NOT NULL AND assigned_accountant IS NOT NULL - ORDER BY RANDOM() - LIMIT ? - ''', (n,)) + ''') rows = cur.fetchall() conn.close() - out = [] - for r in rows: - vec = np.frombuffer(r[2], dtype=np.float32) - out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec}) - return out + sig_ids = np.array([r[0] for r in rows], dtype=np.int64) + accts = np.array([r[1] for r in rows]) + return sig_ids, accts -def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED): - """Sample random cross-CPA pairs; return their cosine similarities.""" +def load_features_for_ids(sig_ids): + conn = sqlite3.connect(DB) + cur = conn.cursor() + placeholders = ','.join('?' * len(sig_ids)) + cur.execute( + f'SELECT signature_id, feature_vector FROM signatures ' + f'WHERE signature_id IN ({placeholders})', + [int(s) for s in sig_ids], + ) + rows = cur.fetchall() + conn.close() + feat_by_id = {} + for sid, blob in rows: + feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32) + return feat_by_id + + +def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED): + """Sample i.i.d. random cross-CPA pairs from the full matched corpus + and return their cosine similarities. + """ rng = np.random.default_rng(seed) - n = len(sample) - feats = np.stack([s['feature'] for s in sample]) - accts = np.array([s['accountant'] for s in sample]) - sims = [] + n = len(sig_ids) + pairs = [] tries = 0 - while len(sims) < n_pairs and tries < n_pairs * 10: + seen_pairs = set() + while len(pairs) < n_pairs and tries < n_pairs * 10: i = rng.integers(n) j = rng.integers(n) if i == j or accts[i] == accts[j]: tries += 1 continue - sim = float(feats[i] @ feats[j]) - sims.append(sim) + a, b = (i, j) if i < j else (j, i) + if (a, b) in seen_pairs: + tries += 1 + continue + seen_pairs.add((a, b)) + pairs.append((a, b)) tries += 1 + + needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair}) + feat_by_id = load_features_for_ids(needed_ids) + + sims = [] + for i, j in pairs: + fi = feat_by_id[int(sig_ids[i])] + fj = feat_by_id[int(sig_ids[j])] + sims.append(float(fi @ fj)) return np.array(sims) @@ -212,9 +246,12 @@ def main(): print(f'Firm A signatures: {int(firm_a_mask.sum()):,}') # --- (1) INTER-CPA NEGATIVE ANCHOR --- - print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...') - sample = load_feature_vectors_sample(n=3000) - inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS) + print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} ' + f'i.i.d. pairs from full matched corpus)...') + pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool() + print(f' pool size: {len(pool_sig_ids):,} matched signatures') + inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts, + n_pairs=N_INTER_PAIRS) print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, ' f'p95={np.percentile(inter_cos, 95):.4f}, ' f'p99={np.percentile(inter_cos, 99):.4f}, ' diff --git a/signature_analysis/29_firm_a_yearly_distribution.py b/signature_analysis/29_firm_a_yearly_distribution.py new file mode 100644 index 0000000..d43e935 --- /dev/null +++ b/signature_analysis/29_firm_a_yearly_distribution.py @@ -0,0 +1,123 @@ +#!/usr/bin/env python3 +""" +Script 29: Firm A Per-Year Cosine Distribution (Table XIII) +============================================================ +Generates the year-by-year Firm A per-signature best-match cosine +distribution reported as Table XIII in the manuscript. Codex / Gemini +round-19 review identified that this table previously had no dedicated +generating script (Appendix B incorrectly attributed it to Script 08, +which has no year_month extraction). + +Definition: + Firm A membership is via CPA registry (accountants.firm joined on + signatures.assigned_accountant), matching the convention used by + scripts 24 and 28. + + For each fiscal year (substr(year_month, 1, 4)): + - N signatures with non-null max_similarity_to_same_accountant + - mean of max_similarity_to_same_accountant (the per-signature + best-match cosine) + - share with max_similarity_to_same_accountant < 0.95 (the + left-tail rate cited in Section IV-G.1) + +Output: + reports/firm_a_yearly/firm_a_yearly_distribution.json + reports/firm_a_yearly/firm_a_yearly_distribution.md +""" + +import json +import sqlite3 +from datetime import datetime +from pathlib import Path + +DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db' +OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/' + 'firm_a_yearly') +OUT.mkdir(parents=True, exist_ok=True) + +FIRM_A = '勤業眾信聯合' + + +def yearly_distribution(conn): + cur = conn.cursor() + cur.execute(""" + SELECT substr(s.year_month, 1, 4) AS year, + COUNT(*) AS n_sigs, + AVG(s.max_similarity_to_same_accountant) AS mean_cos, + SUM(CASE + WHEN s.max_similarity_to_same_accountant < 0.95 + THEN 1 ELSE 0 + END) AS n_below_095 + FROM signatures s + JOIN accountants a ON s.assigned_accountant = a.name + WHERE a.firm = ? + AND s.max_similarity_to_same_accountant IS NOT NULL + AND s.year_month IS NOT NULL + GROUP BY year + ORDER BY year + """, (FIRM_A,)) + + rows = [] + for year, n_sigs, mean_cos, n_below in cur.fetchall(): + rows.append({ + 'year': int(year), + 'n_signatures': n_sigs, + 'mean_best_match_cosine': round(mean_cos, 4), + 'n_below_cosine_095': n_below, + 'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2), + }) + return rows + + +def write_markdown(payload, path): + rows = payload['yearly_rows'] + lines = [] + lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)') + lines.append('') + lines.append(f"Generated at: {payload['generated_at']}") + lines.append('') + lines.append('Firm A membership: CPA registry ' + '(accountants.firm = "勤業眾信聯合"). Per-signature ' + 'best-match cosine = ' + 'signatures.max_similarity_to_same_accountant.') + lines.append('') + lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |') + lines.append('|------|--------|------------------------|--------------|') + for r in rows: + lines.append( + f"| {r['year']} | {r['n_signatures']:,} | " + f"{r['mean_best_match_cosine']:.4f} | " + f"{r['pct_below_cosine_095']:.2f}% |" + ) + path.write_text('\n'.join(lines) + '\n', encoding='utf-8') + + +def main(): + conn = sqlite3.connect(DB) + try: + payload = { + 'generated_at': datetime.now().isoformat(timespec='seconds'), + 'database_path': DB, + 'firm_a_label': FIRM_A, + 'firm_a_membership_definition': ( + 'CPA registry: accountants.firm joined on ' + 'signatures.assigned_accountant' + ), + 'cosine_metric': 'signatures.max_similarity_to_same_accountant', + 'yearly_rows': yearly_distribution(conn), + } + finally: + conn.close() + + json_path = OUT / 'firm_a_yearly_distribution.json' + json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), + encoding='utf-8') + print(f'Wrote {json_path}') + + md_path = OUT / 'firm_a_yearly_distribution.md' + write_markdown(payload, md_path) + print(f'Wrote {md_path}') + + +if __name__ == '__main__': + main()