af08391a68
Gemini 3.1 Pro round-19 (paper/gemini_review_v3_18_4.md) caught FOUR
serious issues that all 18 prior AI review rounds missed, including
fabricated rationalizations and a real statistical flaw. All four
verified by direct DB / script inspection. Verdict: Major Revision; this
commit closes every flagged item.
Fabricated rationalization corrections (text only, numbers unchanged):
- Section IV-H "656 documents excluded" rewritten. Previous text claimed
the exclusion was because "single-signature documents have no same-CPA
pairwise comparison" -- a fabricated explanation that contradicts the
paper's cross-document matching methodology. The truth, verified
against signature_analysis/09_pdf_signature_verdict.py L44 (WHERE
s.is_valid = 1 AND s.assigned_accountant IS NOT NULL): the 656
documents are excluded because none of their detected signatures could
be matched to a registered CPA name (assigned_accountant IS NULL).
- Section IV-F.2 "two CPAs excluded for disambiguation ties" rewritten.
No disambiguation logic exists in script 24; the 178 vs 180 difference
comes from two registered Firm A partners being singletons in the
corpus (one signature each, so per-signature best-match cosine is
undefined and they do not appear in the matched-signature table that
feeds the 70/30 split).
- Appendix B Table XIII provenance corrected. The previous attribution
to 13_deloitte_distribution_analysis.py / accountant_similarity_analysis.json
was wrong: neither artifact has year_month grouping. New script
29_firm_a_yearly_distribution.py reproduces Table XIII exactly from
the database via accountants.firm + signatures.year_month grouping.
Statistical flaw corrections (numbers updated):
- Inter-CPA negative anchor rewritten in 21_expanded_validation.py. The
prior implementation drew 50,000 random cross-CPA pairs from a
LIMIT-3000 random subsample, reusing each signature ~33 times and
artificially tightening Wilson FAR confidence intervals on Table X.
The corrected implementation samples 50,000 i.i.d. pairs uniformly
across the full 168,755-signature matched corpus.
- Re-run script 21. Table X numbers are close to v3.18.4 but no longer
rest on the inflated-precision artifact:
cos > 0.837: FAR 0.2101 (was 0.2062), CI [0.2066, 0.2137]
cos > 0.900: FAR 0.0250 (was 0.0233), CI [0.0237, 0.0264]
cos > 0.945: FAR 0.0008 (unchanged at this resolution)
cos > 0.950: FAR 0.0005 (was 0.0007), CI [0.0003, 0.0007]
cos > 0.973: FAR 0.0002 (was 0.0003), CI [0.0001, 0.0004]
cos > 0.979: FAR 0.0001 (was 0.0002), CI [0.0001, 0.0003]
- Inter-CPA cosine summary stats also updated:
mean 0.763 (was 0.762)
P95 0.886 (was 0.884)
P99 0.915 (was 0.913)
max 0.992 (was 0.988)
- Manuscript IV-F.1 prose updated to reflect the i.i.d. full-corpus
sampling.
Rebuild Paper_A_IEEE_Access_Draft_v3.docx.
Note: this is v3.19.0 because v3.19 closes both fabrication and a
genuine statistical flaw, not just provenance polish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
124 lines
4.2 KiB
Python
124 lines
4.2 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
Script 29: Firm A Per-Year Cosine Distribution (Table XIII)
|
|
============================================================
|
|
Generates the year-by-year Firm A per-signature best-match cosine
|
|
distribution reported as Table XIII in the manuscript. Codex / Gemini
|
|
round-19 review identified that this table previously had no dedicated
|
|
generating script (Appendix B incorrectly attributed it to Script 08,
|
|
which has no year_month extraction).
|
|
|
|
Definition:
|
|
Firm A membership is via CPA registry (accountants.firm joined on
|
|
signatures.assigned_accountant), matching the convention used by
|
|
scripts 24 and 28.
|
|
|
|
For each fiscal year (substr(year_month, 1, 4)):
|
|
- N signatures with non-null max_similarity_to_same_accountant
|
|
- mean of max_similarity_to_same_accountant (the per-signature
|
|
best-match cosine)
|
|
- share with max_similarity_to_same_accountant < 0.95 (the
|
|
left-tail rate cited in Section IV-G.1)
|
|
|
|
Output:
|
|
reports/firm_a_yearly/firm_a_yearly_distribution.json
|
|
reports/firm_a_yearly/firm_a_yearly_distribution.md
|
|
"""
|
|
|
|
import json
|
|
import sqlite3
|
|
from datetime import datetime
|
|
from pathlib import Path
|
|
|
|
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
|
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
|
'firm_a_yearly')
|
|
OUT.mkdir(parents=True, exist_ok=True)
|
|
|
|
FIRM_A = '勤業眾信聯合'
|
|
|
|
|
|
def yearly_distribution(conn):
|
|
cur = conn.cursor()
|
|
cur.execute("""
|
|
SELECT substr(s.year_month, 1, 4) AS year,
|
|
COUNT(*) AS n_sigs,
|
|
AVG(s.max_similarity_to_same_accountant) AS mean_cos,
|
|
SUM(CASE
|
|
WHEN s.max_similarity_to_same_accountant < 0.95
|
|
THEN 1 ELSE 0
|
|
END) AS n_below_095
|
|
FROM signatures s
|
|
JOIN accountants a ON s.assigned_accountant = a.name
|
|
WHERE a.firm = ?
|
|
AND s.max_similarity_to_same_accountant IS NOT NULL
|
|
AND s.year_month IS NOT NULL
|
|
GROUP BY year
|
|
ORDER BY year
|
|
""", (FIRM_A,))
|
|
|
|
rows = []
|
|
for year, n_sigs, mean_cos, n_below in cur.fetchall():
|
|
rows.append({
|
|
'year': int(year),
|
|
'n_signatures': n_sigs,
|
|
'mean_best_match_cosine': round(mean_cos, 4),
|
|
'n_below_cosine_095': n_below,
|
|
'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2),
|
|
})
|
|
return rows
|
|
|
|
|
|
def write_markdown(payload, path):
|
|
rows = payload['yearly_rows']
|
|
lines = []
|
|
lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)')
|
|
lines.append('')
|
|
lines.append(f"Generated at: {payload['generated_at']}")
|
|
lines.append('')
|
|
lines.append('Firm A membership: CPA registry '
|
|
'(accountants.firm = "勤業眾信聯合"). Per-signature '
|
|
'best-match cosine = '
|
|
'signatures.max_similarity_to_same_accountant.')
|
|
lines.append('')
|
|
lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |')
|
|
lines.append('|------|--------|------------------------|--------------|')
|
|
for r in rows:
|
|
lines.append(
|
|
f"| {r['year']} | {r['n_signatures']:,} | "
|
|
f"{r['mean_best_match_cosine']:.4f} | "
|
|
f"{r['pct_below_cosine_095']:.2f}% |"
|
|
)
|
|
path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
|
|
|
|
|
|
def main():
|
|
conn = sqlite3.connect(DB)
|
|
try:
|
|
payload = {
|
|
'generated_at': datetime.now().isoformat(timespec='seconds'),
|
|
'database_path': DB,
|
|
'firm_a_label': FIRM_A,
|
|
'firm_a_membership_definition': (
|
|
'CPA registry: accountants.firm joined on '
|
|
'signatures.assigned_accountant'
|
|
),
|
|
'cosine_metric': 'signatures.max_similarity_to_same_accountant',
|
|
'yearly_rows': yearly_distribution(conn),
|
|
}
|
|
finally:
|
|
conn.close()
|
|
|
|
json_path = OUT / 'firm_a_yearly_distribution.json'
|
|
json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
|
|
encoding='utf-8')
|
|
print(f'Wrote {json_path}')
|
|
|
|
md_path = OUT / 'firm_a_yearly_distribution.md'
|
|
write_markdown(payload, md_path)
|
|
print(f'Wrote {md_path}')
|
|
|
|
|
|
if __name__ == '__main__':
|
|
main()
|