Paper A v3.19.0: address Gemini 3.1 Pro round-19 Major Revision findings

Gemini 3.1 Pro round-19 (paper/gemini_review_v3_18_4.md) caught FOUR
serious issues that all 18 prior AI review rounds missed, including
fabricated rationalizations and a real statistical flaw. All four
verified by direct DB / script inspection. Verdict: Major Revision; this
commit closes every flagged item.

Fabricated rationalization corrections (text only, numbers unchanged):

- Section IV-H "656 documents excluded" rewritten. Previous text claimed
  the exclusion was because "single-signature documents have no same-CPA
  pairwise comparison" -- a fabricated explanation that contradicts the
  paper's cross-document matching methodology. The truth, verified
  against signature_analysis/09_pdf_signature_verdict.py L44 (WHERE
  s.is_valid = 1 AND s.assigned_accountant IS NOT NULL): the 656
  documents are excluded because none of their detected signatures could
  be matched to a registered CPA name (assigned_accountant IS NULL).
- Section IV-F.2 "two CPAs excluded for disambiguation ties" rewritten.
  No disambiguation logic exists in script 24; the 178 vs 180 difference
  comes from two registered Firm A partners being singletons in the
  corpus (one signature each, so per-signature best-match cosine is
  undefined and they do not appear in the matched-signature table that
  feeds the 70/30 split).
- Appendix B Table XIII provenance corrected. The previous attribution
  to 13_deloitte_distribution_analysis.py / accountant_similarity_analysis.json
  was wrong: neither artifact has year_month grouping. New script
  29_firm_a_yearly_distribution.py reproduces Table XIII exactly from
  the database via accountants.firm + signatures.year_month grouping.

Statistical flaw corrections (numbers updated):

- Inter-CPA negative anchor rewritten in 21_expanded_validation.py. The
  prior implementation drew 50,000 random cross-CPA pairs from a
  LIMIT-3000 random subsample, reusing each signature ~33 times and
  artificially tightening Wilson FAR confidence intervals on Table X.
  The corrected implementation samples 50,000 i.i.d. pairs uniformly
  across the full 168,755-signature matched corpus.
- Re-run script 21. Table X numbers are close to v3.18.4 but no longer
  rest on the inflated-precision artifact:
    cos > 0.837: FAR 0.2101 (was 0.2062), CI [0.2066, 0.2137]
    cos > 0.900: FAR 0.0250 (was 0.0233), CI [0.0237, 0.0264]
    cos > 0.945: FAR 0.0008 (unchanged at this resolution)
    cos > 0.950: FAR 0.0005 (was 0.0007), CI [0.0003, 0.0007]
    cos > 0.973: FAR 0.0002 (was 0.0003), CI [0.0001, 0.0004]
    cos > 0.979: FAR 0.0001 (was 0.0002), CI [0.0001, 0.0003]
- Inter-CPA cosine summary stats also updated:
    mean 0.763 (was 0.762)
    P95 0.886 (was 0.884)
    P99 0.915 (was 0.913)
    max 0.992 (was 0.988)
- Manuscript IV-F.1 prose updated to reflect the i.i.d. full-corpus
  sampling.

Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Note: this is v3.19.0 because v3.19 closes both fabrication and a
genuine statistical flaw, not just provenance polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-27 21:40:42 +08:00
parent 1e37d344ea
commit af08391a68
5 changed files with 192 additions and 32 deletions
Binary file not shown.
+1 -1
View File
@@ -49,7 +49,7 @@ For reproducibility, the following table maps each numerical table in Section IV
| Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` | | Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
| Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` | | Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
| Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` | | Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
| Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | derived from `reports/accountant_similarity_analysis.json` filtered to Firm A; figures in `reports/figures/` | | Table XIII (Firm A per-year cosine distribution) | `29_firm_a_yearly_distribution.py` | `reports/firm_a_yearly/firm_a_yearly_distribution.json` |
| Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` | | Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
| Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` | | Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
| Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) | | Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) |
+8 -8
View File
@@ -150,7 +150,7 @@ We report three validation analyses corresponding to the anchors of Section III-
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G. Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B). Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$). As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation. Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X. We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold. The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
@@ -159,12 +159,12 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs <!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
| Threshold | FAR | FAR 95% Wilson CI | | Threshold | FAR | FAR 95% Wilson CI |
|-----------|-----|-------------------| |-----------|-----|-------------------|
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] | | 0.837 (all-pairs KDE crossover) | 0.2101 | [0.2066, 0.2137] |
| 0.900 | 0.0233 | [0.0221, 0.0247] | | 0.900 | 0.0250 | [0.0237, 0.0264] |
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] | | 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0007 | [0.0005, 0.0009] | | 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0005 | [0.0003, 0.0007] |
| 0.973 (signature-level Beta/KDE upper bound) | 0.0003 | [0.0002, 0.0004] | | 0.973 (signature-level Beta/KDE upper bound) | 0.0002 | [0.0001, 0.0004] |
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0002 | [0.0001, 0.0004] | | 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0001 | [0.0001, 0.0003] |
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F. Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
--> -->
@@ -178,7 +178,7 @@ The very low FAR at the operational cut is therefore informative about specifici
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure) ### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures). We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here. The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule.
Thresholds are re-derived from calibration-fold percentiles only. Thresholds are re-derived from calibration-fold percentiles only.
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test. Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
@@ -340,7 +340,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
## H. Classification Results ## H. Classification Results
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents. Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here. The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here.
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts. We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports. Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
+60 -23
View File
@@ -85,44 +85,78 @@ def load_signatures():
return rows return rows
def load_feature_vectors_sample(n=2000): def load_signature_ids_for_negative_pool(seed=SEED):
"""Load feature vectors for inter-CPA negative-anchor sampling.""" """Load lightweight (sig_id, accountant) pool from the entire matched
corpus. Per Gemini round-19 review, the prior implementation drew
50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing
each signature ~33 times and artificially tightening Wilson FAR CIs.
The corrected implementation samples pairs i.i.d. across the FULL
matched corpus (~168k signatures); only the unique signatures that
actually appear in the sampled pairs need feature vectors loaded.
"""
conn = sqlite3.connect(DB) conn = sqlite3.connect(DB)
cur = conn.cursor() cur = conn.cursor()
cur.execute(''' cur.execute('''
SELECT signature_id, assigned_accountant, feature_vector SELECT signature_id, assigned_accountant
FROM signatures FROM signatures
WHERE feature_vector IS NOT NULL WHERE feature_vector IS NOT NULL
AND assigned_accountant IS NOT NULL AND assigned_accountant IS NOT NULL
ORDER BY RANDOM() ''')
LIMIT ?
''', (n,))
rows = cur.fetchall() rows = cur.fetchall()
conn.close() conn.close()
out = [] sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
for r in rows: accts = np.array([r[1] for r in rows])
vec = np.frombuffer(r[2], dtype=np.float32) return sig_ids, accts
out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
return out
def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED): def load_features_for_ids(sig_ids):
"""Sample random cross-CPA pairs; return their cosine similarities.""" conn = sqlite3.connect(DB)
cur = conn.cursor()
placeholders = ','.join('?' * len(sig_ids))
cur.execute(
f'SELECT signature_id, feature_vector FROM signatures '
f'WHERE signature_id IN ({placeholders})',
[int(s) for s in sig_ids],
)
rows = cur.fetchall()
conn.close()
feat_by_id = {}
for sid, blob in rows:
feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32)
return feat_by_id
def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED):
"""Sample i.i.d. random cross-CPA pairs from the full matched corpus
and return their cosine similarities.
"""
rng = np.random.default_rng(seed) rng = np.random.default_rng(seed)
n = len(sample) n = len(sig_ids)
feats = np.stack([s['feature'] for s in sample]) pairs = []
accts = np.array([s['accountant'] for s in sample])
sims = []
tries = 0 tries = 0
while len(sims) < n_pairs and tries < n_pairs * 10: seen_pairs = set()
while len(pairs) < n_pairs and tries < n_pairs * 10:
i = rng.integers(n) i = rng.integers(n)
j = rng.integers(n) j = rng.integers(n)
if i == j or accts[i] == accts[j]: if i == j or accts[i] == accts[j]:
tries += 1 tries += 1
continue continue
sim = float(feats[i] @ feats[j]) a, b = (i, j) if i < j else (j, i)
sims.append(sim) if (a, b) in seen_pairs:
tries += 1 tries += 1
continue
seen_pairs.add((a, b))
pairs.append((a, b))
tries += 1
needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair})
feat_by_id = load_features_for_ids(needed_ids)
sims = []
for i, j in pairs:
fi = feat_by_id[int(sig_ids[i])]
fj = feat_by_id[int(sig_ids[j])]
sims.append(float(fi @ fj))
return np.array(sims) return np.array(sims)
@@ -212,9 +246,12 @@ def main():
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}') print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
# --- (1) INTER-CPA NEGATIVE ANCHOR --- # --- (1) INTER-CPA NEGATIVE ANCHOR ---
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...') print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} '
sample = load_feature_vectors_sample(n=3000) f'i.i.d. pairs from full matched corpus)...')
inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS) pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool()
print(f' pool size: {len(pool_sig_ids):,} matched signatures')
inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts,
n_pairs=N_INTER_PAIRS)
print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, ' print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, '
f'p95={np.percentile(inter_cos, 95):.4f}, ' f'p95={np.percentile(inter_cos, 95):.4f}, '
f'p99={np.percentile(inter_cos, 99):.4f}, ' f'p99={np.percentile(inter_cos, 99):.4f}, '
@@ -0,0 +1,123 @@
#!/usr/bin/env python3
"""
Script 29: Firm A Per-Year Cosine Distribution (Table XIII)
============================================================
Generates the year-by-year Firm A per-signature best-match cosine
distribution reported as Table XIII in the manuscript. Codex / Gemini
round-19 review identified that this table previously had no dedicated
generating script (Appendix B incorrectly attributed it to Script 08,
which has no year_month extraction).
Definition:
Firm A membership is via CPA registry (accountants.firm joined on
signatures.assigned_accountant), matching the convention used by
scripts 24 and 28.
For each fiscal year (substr(year_month, 1, 4)):
- N signatures with non-null max_similarity_to_same_accountant
- mean of max_similarity_to_same_accountant (the per-signature
best-match cosine)
- share with max_similarity_to_same_accountant < 0.95 (the
left-tail rate cited in Section IV-G.1)
Output:
reports/firm_a_yearly/firm_a_yearly_distribution.json
reports/firm_a_yearly/firm_a_yearly_distribution.md
"""
import json
import sqlite3
from datetime import datetime
from pathlib import Path
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
'firm_a_yearly')
OUT.mkdir(parents=True, exist_ok=True)
FIRM_A = '勤業眾信聯合'
def yearly_distribution(conn):
cur = conn.cursor()
cur.execute("""
SELECT substr(s.year_month, 1, 4) AS year,
COUNT(*) AS n_sigs,
AVG(s.max_similarity_to_same_accountant) AS mean_cos,
SUM(CASE
WHEN s.max_similarity_to_same_accountant < 0.95
THEN 1 ELSE 0
END) AS n_below_095
FROM signatures s
JOIN accountants a ON s.assigned_accountant = a.name
WHERE a.firm = ?
AND s.max_similarity_to_same_accountant IS NOT NULL
AND s.year_month IS NOT NULL
GROUP BY year
ORDER BY year
""", (FIRM_A,))
rows = []
for year, n_sigs, mean_cos, n_below in cur.fetchall():
rows.append({
'year': int(year),
'n_signatures': n_sigs,
'mean_best_match_cosine': round(mean_cos, 4),
'n_below_cosine_095': n_below,
'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2),
})
return rows
def write_markdown(payload, path):
rows = payload['yearly_rows']
lines = []
lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)')
lines.append('')
lines.append(f"Generated at: {payload['generated_at']}")
lines.append('')
lines.append('Firm A membership: CPA registry '
'(accountants.firm = "勤業眾信聯合"). Per-signature '
'best-match cosine = '
'signatures.max_similarity_to_same_accountant.')
lines.append('')
lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |')
lines.append('|------|--------|------------------------|--------------|')
for r in rows:
lines.append(
f"| {r['year']} | {r['n_signatures']:,} | "
f"{r['mean_best_match_cosine']:.4f} | "
f"{r['pct_below_cosine_095']:.2f}% |"
)
path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
def main():
conn = sqlite3.connect(DB)
try:
payload = {
'generated_at': datetime.now().isoformat(timespec='seconds'),
'database_path': DB,
'firm_a_label': FIRM_A,
'firm_a_membership_definition': (
'CPA registry: accountants.firm joined on '
'signatures.assigned_accountant'
),
'cosine_metric': 'signatures.max_similarity_to_same_accountant',
'yearly_rows': yearly_distribution(conn),
}
finally:
conn.close()
json_path = OUT / 'firm_a_yearly_distribution.json'
json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
encoding='utf-8')
print(f'Wrote {json_path}')
md_path = OUT / 'firm_a_yearly_distribution.md'
write_markdown(payload, md_path)
print(f'Wrote {md_path}')
if __name__ == '__main__':
main()