Paper A v3.19.0: address Gemini 3.1 Pro round-19 Major Revision findings

Gemini 3.1 Pro round-19 (paper/gemini_review_v3_18_4.md) caught FOUR
serious issues that all 18 prior AI review rounds missed, including
fabricated rationalizations and a real statistical flaw. All four
verified by direct DB / script inspection. Verdict: Major Revision; this
commit closes every flagged item.

Fabricated rationalization corrections (text only, numbers unchanged):

- Section IV-H "656 documents excluded" rewritten. Previous text claimed
  the exclusion was because "single-signature documents have no same-CPA
  pairwise comparison" -- a fabricated explanation that contradicts the
  paper's cross-document matching methodology. The truth, verified
  against signature_analysis/09_pdf_signature_verdict.py L44 (WHERE
  s.is_valid = 1 AND s.assigned_accountant IS NOT NULL): the 656
  documents are excluded because none of their detected signatures could
  be matched to a registered CPA name (assigned_accountant IS NULL).
- Section IV-F.2 "two CPAs excluded for disambiguation ties" rewritten.
  No disambiguation logic exists in script 24; the 178 vs 180 difference
  comes from two registered Firm A partners being singletons in the
  corpus (one signature each, so per-signature best-match cosine is
  undefined and they do not appear in the matched-signature table that
  feeds the 70/30 split).
- Appendix B Table XIII provenance corrected. The previous attribution
  to 13_deloitte_distribution_analysis.py / accountant_similarity_analysis.json
  was wrong: neither artifact has year_month grouping. New script
  29_firm_a_yearly_distribution.py reproduces Table XIII exactly from
  the database via accountants.firm + signatures.year_month grouping.

Statistical flaw corrections (numbers updated):

- Inter-CPA negative anchor rewritten in 21_expanded_validation.py. The
  prior implementation drew 50,000 random cross-CPA pairs from a
  LIMIT-3000 random subsample, reusing each signature ~33 times and
  artificially tightening Wilson FAR confidence intervals on Table X.
  The corrected implementation samples 50,000 i.i.d. pairs uniformly
  across the full 168,755-signature matched corpus.
- Re-run script 21. Table X numbers are close to v3.18.4 but no longer
  rest on the inflated-precision artifact:
    cos > 0.837: FAR 0.2101 (was 0.2062), CI [0.2066, 0.2137]
    cos > 0.900: FAR 0.0250 (was 0.0233), CI [0.0237, 0.0264]
    cos > 0.945: FAR 0.0008 (unchanged at this resolution)
    cos > 0.950: FAR 0.0005 (was 0.0007), CI [0.0003, 0.0007]
    cos > 0.973: FAR 0.0002 (was 0.0003), CI [0.0001, 0.0004]
    cos > 0.979: FAR 0.0001 (was 0.0002), CI [0.0001, 0.0003]
- Inter-CPA cosine summary stats also updated:
    mean 0.763 (was 0.762)
    P95 0.886 (was 0.884)
    P99 0.915 (was 0.913)
    max 0.992 (was 0.988)
- Manuscript IV-F.1 prose updated to reflect the i.i.d. full-corpus
  sampling.

Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Note: this is v3.19.0 because v3.19 closes both fabrication and a
genuine statistical flaw, not just provenance polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 21:40:42 +08:00
parent 1e37d344ea
commit af08391a68
5 changed files with 192 additions and 32 deletions
+60 -23
View File
@@ -85,44 +85,78 @@ def load_signatures():
return rows
def load_feature_vectors_sample(n=2000):
"""Load feature vectors for inter-CPA negative-anchor sampling."""
def load_signature_ids_for_negative_pool(seed=SEED):
"""Load lightweight (sig_id, accountant) pool from the entire matched
corpus. Per Gemini round-19 review, the prior implementation drew
50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing
each signature ~33 times and artificially tightening Wilson FAR CIs.
The corrected implementation samples pairs i.i.d. across the FULL
matched corpus (~168k signatures); only the unique signatures that
actually appear in the sampled pairs need feature vectors loaded.
"""
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute('''
SELECT signature_id, assigned_accountant, feature_vector
SELECT signature_id, assigned_accountant
FROM signatures
WHERE feature_vector IS NOT NULL
AND assigned_accountant IS NOT NULL
ORDER BY RANDOM()
LIMIT ?
''', (n,))
''')
rows = cur.fetchall()
conn.close()
out = []
for r in rows:
vec = np.frombuffer(r[2], dtype=np.float32)
out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
return out
sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
accts = np.array([r[1] for r in rows])
return sig_ids, accts
def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
"""Sample random cross-CPA pairs; return their cosine similarities."""
def load_features_for_ids(sig_ids):
conn = sqlite3.connect(DB)
cur = conn.cursor()
placeholders = ','.join('?' * len(sig_ids))
cur.execute(
f'SELECT signature_id, feature_vector FROM signatures '
f'WHERE signature_id IN ({placeholders})',
[int(s) for s in sig_ids],
)
rows = cur.fetchall()
conn.close()
feat_by_id = {}
for sid, blob in rows:
feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32)
return feat_by_id
def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED):
"""Sample i.i.d. random cross-CPA pairs from the full matched corpus
and return their cosine similarities.
"""
rng = np.random.default_rng(seed)
n = len(sample)
feats = np.stack([s['feature'] for s in sample])
accts = np.array([s['accountant'] for s in sample])
sims = []
n = len(sig_ids)
pairs = []
tries = 0
while len(sims) < n_pairs and tries < n_pairs * 10:
seen_pairs = set()
while len(pairs) < n_pairs and tries < n_pairs * 10:
i = rng.integers(n)
j = rng.integers(n)
if i == j or accts[i] == accts[j]:
tries += 1
continue
sim = float(feats[i] @ feats[j])
sims.append(sim)
a, b = (i, j) if i < j else (j, i)
if (a, b) in seen_pairs:
tries += 1
continue
seen_pairs.add((a, b))
pairs.append((a, b))
tries += 1
needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair})
feat_by_id = load_features_for_ids(needed_ids)
sims = []
for i, j in pairs:
fi = feat_by_id[int(sig_ids[i])]
fj = feat_by_id[int(sig_ids[j])]
sims.append(float(fi @ fj))
return np.array(sims)
@@ -212,9 +246,12 @@ def main():
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
# --- (1) INTER-CPA NEGATIVE ANCHOR ---
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
sample = load_feature_vectors_sample(n=3000)
inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} '
f'i.i.d. pairs from full matched corpus)...')
pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool()
print(f' pool size: {len(pool_sig_ids):,} matched signatures')
inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts,
n_pairs=N_INTER_PAIRS)
print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, '
f'p95={np.percentile(inter_cos, 95):.4f}, '
f'p99={np.percentile(inter_cos, 99):.4f}, '