Paper A v3.19.0: address Gemini 3.1 Pro round-19 Major Revision findings

Gemini 3.1 Pro round-19 (paper/gemini_review_v3_18_4.md) caught FOUR serious issues that all 18 prior AI review rounds missed, including fabricated rationalizations and a real statistical flaw. All four verified by direct DB / script inspection. Verdict: Major Revision; this commit closes every flagged item. Fabricated rationalization corrections (text only, numbers unchanged): - Section IV-H "656 documents excluded" rewritten. Previous text claimed the exclusion was because "single-signature documents have no same-CPA pairwise comparison" -- a fabricated explanation that contradicts the paper's cross-document matching methodology. The truth, verified against signature_analysis/09_pdf_signature_verdict.py L44 (WHERE s.is_valid = 1 AND s.assigned_accountant IS NOT NULL): the 656 documents are excluded because none of their detected signatures could be matched to a registered CPA name (assigned_accountant IS NULL). - Section IV-F.2 "two CPAs excluded for disambiguation ties" rewritten. No disambiguation logic exists in script 24; the 178 vs 180 difference comes from two registered Firm A partners being singletons in the corpus (one signature each, so per-signature best-match cosine is undefined and they do not appear in the matched-signature table that feeds the 70/30 split). - Appendix B Table XIII provenance corrected. The previous attribution to 13_deloitte_distribution_analysis.py / accountant_similarity_analysis.json was wrong: neither artifact has year_month grouping. New script 29_firm_a_yearly_distribution.py reproduces Table XIII exactly from the database via accountants.firm + signatures.year_month grouping. Statistical flaw corrections (numbers updated): - Inter-CPA negative anchor rewritten in 21_expanded_validation.py. The prior implementation drew 50,000 random cross-CPA pairs from a LIMIT-3000 random subsample, reusing each signature ~33 times and artificially tightening Wilson FAR confidence intervals on Table X. The corrected implementation samples 50,000 i.i.d. pairs uniformly across the full 168,755-signature matched corpus. - Re-run script 21. Table X numbers are close to v3.18.4 but no longer rest on the inflated-precision artifact: cos > 0.837: FAR 0.2101 (was 0.2062), CI [0.2066, 0.2137] cos > 0.900: FAR 0.0250 (was 0.0233), CI [0.0237, 0.0264] cos > 0.945: FAR 0.0008 (unchanged at this resolution) cos > 0.950: FAR 0.0005 (was 0.0007), CI [0.0003, 0.0007] cos > 0.973: FAR 0.0002 (was 0.0003), CI [0.0001, 0.0004] cos > 0.979: FAR 0.0001 (was 0.0002), CI [0.0001, 0.0003] - Inter-CPA cosine summary stats also updated: mean 0.763 (was 0.762) P95 0.886 (was 0.884) P99 0.915 (was 0.913) max 0.992 (was 0.988) - Manuscript IV-F.1 prose updated to reflect the i.i.d. full-corpus sampling. Rebuild Paper_A_IEEE_Access_Draft_v3.docx. Note: this is v3.19.0 because v3.19 closes both fabrication and a genuine statistical flaw, not just provenance polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:40:42 +08:00
parent 1e37d344ea
commit af08391a68
5 changed files with 192 additions and 32 deletions
@@ -85,44 +85,78 @@ def load_signatures():
    return rows


-def load_feature_vectors_sample(n=2000):
-    """Load feature vectors for inter-CPA negative-anchor sampling."""
+def load_signature_ids_for_negative_pool(seed=SEED):
+    """Load lightweight (sig_id, accountant) pool from the entire matched
+    corpus. Per Gemini round-19 review, the prior implementation drew
+    50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing
+    each signature ~33 times and artificially tightening Wilson FAR CIs.
+    The corrected implementation samples pairs i.i.d. across the FULL
+    matched corpus (~168k signatures); only the unique signatures that
+    actually appear in the sampled pairs need feature vectors loaded.
+    """
    conn = sqlite3.connect(DB)
    cur = conn.cursor()
    cur.execute('''
-        SELECT signature_id, assigned_accountant, feature_vector
+        SELECT signature_id, assigned_accountant
        FROM signatures
        WHERE feature_vector IS NOT NULL
          AND assigned_accountant IS NOT NULL
-        ORDER BY RANDOM()
-        LIMIT ?
-    ''', (n,))
+    ''')
    rows = cur.fetchall()
    conn.close()
-    out = []
-    for r in rows:
-        vec = np.frombuffer(r[2], dtype=np.float32)
-        out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
-    return out
+    sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
+    accts = np.array([r[1] for r in rows])
+    return sig_ids, accts


-def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
-    """Sample random cross-CPA pairs; return their cosine similarities."""
+def load_features_for_ids(sig_ids):
+    conn = sqlite3.connect(DB)
+    cur = conn.cursor()
+    placeholders = ','.join('?' * len(sig_ids))
+    cur.execute(
+        f'SELECT signature_id, feature_vector FROM signatures '
+        f'WHERE signature_id IN ({placeholders})',
+        [int(s) for s in sig_ids],
+    )
+    rows = cur.fetchall()
+    conn.close()
+    feat_by_id = {}
+    for sid, blob in rows:
+        feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32)
+    return feat_by_id
+
+
+def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED):
+    """Sample i.i.d. random cross-CPA pairs from the full matched corpus
+    and return their cosine similarities.
+    """
    rng = np.random.default_rng(seed)
-    n = len(sample)
-    feats = np.stack([s['feature'] for s in sample])
-    accts = np.array([s['accountant'] for s in sample])
-    sims = []
+    n = len(sig_ids)
+    pairs = []
    tries = 0
-    while len(sims) < n_pairs and tries < n_pairs * 10:
+    seen_pairs = set()
+    while len(pairs) < n_pairs and tries < n_pairs * 10:
        i = rng.integers(n)
        j = rng.integers(n)
        if i == j or accts[i] == accts[j]:
            tries += 1
            continue
-        sim = float(feats[i] @ feats[j])
-        sims.append(sim)
+        a, b = (i, j) if i < j else (j, i)
+        if (a, b) in seen_pairs:
+            tries += 1
+            continue
+        seen_pairs.add((a, b))
+        pairs.append((a, b))
        tries += 1
+
+    needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair})
+    feat_by_id = load_features_for_ids(needed_ids)
+
+    sims = []
+    for i, j in pairs:
+        fi = feat_by_id[int(sig_ids[i])]
+        fj = feat_by_id[int(sig_ids[j])]
+        sims.append(float(fi @ fj))
    return np.array(sims)


@@ -212,9 +246,12 @@ def main():
    print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')

    # --- (1) INTER-CPA NEGATIVE ANCHOR ---
-    print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
-    sample = load_feature_vectors_sample(n=3000)
-    inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
+    print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} '
+          f'i.i.d. pairs from full matched corpus)...')
+    pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool()
+    print(f'  pool size: {len(pool_sig_ids):,} matched signatures')
+    inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts,
+                                         n_pairs=N_INTER_PAIRS)
    print(f'  inter-CPA cos: mean={inter_cos.mean():.4f}, '
          f'p95={np.percentile(inter_cos, 95):.4f}, '
          f'p99={np.percentile(inter_cos, 99):.4f}, '
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+"""
+Script 29: Firm A Per-Year Cosine Distribution (Table XIII)
+============================================================
+Generates the year-by-year Firm A per-signature best-match cosine
+distribution reported as Table XIII in the manuscript. Codex / Gemini
+round-19 review identified that this table previously had no dedicated
+generating script (Appendix B incorrectly attributed it to Script 08,
+which has no year_month extraction).
+
+Definition:
+  Firm A membership is via CPA registry (accountants.firm joined on
+  signatures.assigned_accountant), matching the convention used by
+  scripts 24 and 28.
+
+  For each fiscal year (substr(year_month, 1, 4)):
+    - N signatures with non-null max_similarity_to_same_accountant
+    - mean of max_similarity_to_same_accountant (the per-signature
+      best-match cosine)
+    - share with max_similarity_to_same_accountant < 0.95 (the
+      left-tail rate cited in Section IV-G.1)
+
+Output:
+  reports/firm_a_yearly/firm_a_yearly_distribution.json
+  reports/firm_a_yearly/firm_a_yearly_distribution.md
+"""
+
+import json
+import sqlite3
+from datetime import datetime
+from pathlib import Path
+
+DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
+OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
+           'firm_a_yearly')
+OUT.mkdir(parents=True, exist_ok=True)
+
+FIRM_A = '勤業眾信聯合'
+
+
+def yearly_distribution(conn):
+    cur = conn.cursor()
+    cur.execute("""
+        SELECT substr(s.year_month, 1, 4) AS year,
+               COUNT(*) AS n_sigs,
+               AVG(s.max_similarity_to_same_accountant) AS mean_cos,
+               SUM(CASE
+                     WHEN s.max_similarity_to_same_accountant < 0.95
+                     THEN 1 ELSE 0
+                   END) AS n_below_095
+        FROM signatures s
+        JOIN accountants a ON s.assigned_accountant = a.name
+        WHERE a.firm = ?
+          AND s.max_similarity_to_same_accountant IS NOT NULL
+          AND s.year_month IS NOT NULL
+        GROUP BY year
+        ORDER BY year
+    """, (FIRM_A,))
+
+    rows = []
+    for year, n_sigs, mean_cos, n_below in cur.fetchall():
+        rows.append({
+            'year': int(year),
+            'n_signatures': n_sigs,
+            'mean_best_match_cosine': round(mean_cos, 4),
+            'n_below_cosine_095': n_below,
+            'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2),
+        })
+    return rows
+
+
+def write_markdown(payload, path):
+    rows = payload['yearly_rows']
+    lines = []
+    lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)')
+    lines.append('')
+    lines.append(f"Generated at: {payload['generated_at']}")
+    lines.append('')
+    lines.append('Firm A membership: CPA registry '
+                 '(accountants.firm = "勤業眾信聯合"). Per-signature '
+                 'best-match cosine = '
+                 'signatures.max_similarity_to_same_accountant.')
+    lines.append('')
+    lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |')
+    lines.append('|------|--------|------------------------|--------------|')
+    for r in rows:
+        lines.append(
+            f"| {r['year']} | {r['n_signatures']:,} | "
+            f"{r['mean_best_match_cosine']:.4f} | "
+            f"{r['pct_below_cosine_095']:.2f}% |"
+        )
+    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
+
+
+def main():
+    conn = sqlite3.connect(DB)
+    try:
+        payload = {
+            'generated_at': datetime.now().isoformat(timespec='seconds'),
+            'database_path': DB,
+            'firm_a_label': FIRM_A,
+            'firm_a_membership_definition': (
+                'CPA registry: accountants.firm joined on '
+                'signatures.assigned_accountant'
+            ),
+            'cosine_metric': 'signatures.max_similarity_to_same_accountant',
+            'yearly_rows': yearly_distribution(conn),
+        }
+    finally:
+        conn.close()
+
+    json_path = OUT / 'firm_a_yearly_distribution.json'
+    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
+                         encoding='utf-8')
+    print(f'Wrote {json_path}')
+
+    md_path = OUT / 'firm_a_yearly_distribution.md'
+    write_markdown(payload, md_path)
+    print(f'Wrote {md_path}')
+
+
+if __name__ == '__main__':
+    main()