Paper A v3.19.0: address Gemini 3.1 Pro round-19 Major Revision findings
Gemini 3.1 Pro round-19 (paper/gemini_review_v3_18_4.md) caught FOUR
serious issues that all 18 prior AI review rounds missed, including
fabricated rationalizations and a real statistical flaw. All four
verified by direct DB / script inspection. Verdict: Major Revision; this
commit closes every flagged item.
Fabricated rationalization corrections (text only, numbers unchanged):
- Section IV-H "656 documents excluded" rewritten. Previous text claimed
the exclusion was because "single-signature documents have no same-CPA
pairwise comparison" -- a fabricated explanation that contradicts the
paper's cross-document matching methodology. The truth, verified
against signature_analysis/09_pdf_signature_verdict.py L44 (WHERE
s.is_valid = 1 AND s.assigned_accountant IS NOT NULL): the 656
documents are excluded because none of their detected signatures could
be matched to a registered CPA name (assigned_accountant IS NULL).
- Section IV-F.2 "two CPAs excluded for disambiguation ties" rewritten.
No disambiguation logic exists in script 24; the 178 vs 180 difference
comes from two registered Firm A partners being singletons in the
corpus (one signature each, so per-signature best-match cosine is
undefined and they do not appear in the matched-signature table that
feeds the 70/30 split).
- Appendix B Table XIII provenance corrected. The previous attribution
to 13_deloitte_distribution_analysis.py / accountant_similarity_analysis.json
was wrong: neither artifact has year_month grouping. New script
29_firm_a_yearly_distribution.py reproduces Table XIII exactly from
the database via accountants.firm + signatures.year_month grouping.
Statistical flaw corrections (numbers updated):
- Inter-CPA negative anchor rewritten in 21_expanded_validation.py. The
prior implementation drew 50,000 random cross-CPA pairs from a
LIMIT-3000 random subsample, reusing each signature ~33 times and
artificially tightening Wilson FAR confidence intervals on Table X.
The corrected implementation samples 50,000 i.i.d. pairs uniformly
across the full 168,755-signature matched corpus.
- Re-run script 21. Table X numbers are close to v3.18.4 but no longer
rest on the inflated-precision artifact:
cos > 0.837: FAR 0.2101 (was 0.2062), CI [0.2066, 0.2137]
cos > 0.900: FAR 0.0250 (was 0.0233), CI [0.0237, 0.0264]
cos > 0.945: FAR 0.0008 (unchanged at this resolution)
cos > 0.950: FAR 0.0005 (was 0.0007), CI [0.0003, 0.0007]
cos > 0.973: FAR 0.0002 (was 0.0003), CI [0.0001, 0.0004]
cos > 0.979: FAR 0.0001 (was 0.0002), CI [0.0001, 0.0003]
- Inter-CPA cosine summary stats also updated:
mean 0.763 (was 0.762)
P95 0.886 (was 0.884)
P99 0.915 (was 0.913)
max 0.992 (was 0.988)
- Manuscript IV-F.1 prose updated to reflect the i.i.d. full-corpus
sampling.
Rebuild Paper_A_IEEE_Access_Draft_v3.docx.
Note: this is v3.19.0 because v3.19 closes both fabrication and a
genuine statistical flaw, not just provenance polish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Binary file not shown.
@@ -49,7 +49,7 @@ For reproducibility, the following table maps each numerical table in Section IV
|
|||||||
| Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
|
| Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
|
||||||
| Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
|
| Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
|
||||||
| Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
|
| Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
|
||||||
| Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | derived from `reports/accountant_similarity_analysis.json` filtered to Firm A; figures in `reports/figures/` |
|
| Table XIII (Firm A per-year cosine distribution) | `29_firm_a_yearly_distribution.py` | `reports/firm_a_yearly/firm_a_yearly_distribution.json` |
|
||||||
| Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
|
| Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
|
||||||
| Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
|
| Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
|
||||||
| Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) |
|
| Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_signature_verdicts.json`; `reports/pdf_signature_verdict_report.md` (CSV / XLSX bulk reports also at `reports/`) |
|
||||||
|
|||||||
@@ -150,7 +150,7 @@ We report three validation analyses corresponding to the anchors of Section III-
|
|||||||
|
|
||||||
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
|
Of the 182,328 extracted signatures, 310 have a same-CPA nearest match that is byte-identical after crop and normalization (pixel-identical-to-closest = 1); these form the byte-identity positive anchor---a pair-level proof of image reuse that serves as conservative ground truth for non-hand-signed signatures, subject to the source-template edge case discussed in Section V-G.
|
||||||
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
|
Within Firm A specifically, 145 of these byte-identical signatures are distributed across 50 distinct partners (of 180 registered Firm A partners), with 35 of the byte-identical pairs spanning different fiscal years; this Firm A decomposition is reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (Appendix B).
|
||||||
As the gold-negative anchor we sample 50,000 random cross-CPA signature pairs (inter-CPA cosine: mean $= 0.762$, $P_{95} = 0.884$, $P_{99} = 0.913$, max $= 0.988$).
|
As the gold-negative anchor we sample 50,000 i.i.d. random cross-CPA signature pairs from the full 168,755-signature matched corpus (inter-CPA cosine: mean $= 0.763$, $P_{95} = 0.886$, $P_{99} = 0.915$, max $= 0.992$).
|
||||||
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
|
Because the positive and negative anchor populations are constructed from different sampling units (byte-identical same-CPA pairs vs random inter-CPA pairs), their relative prevalence in the combined anchor set is arbitrary, and precision / $F_1$ / recall therefore have no meaningful population interpretation.
|
||||||
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
|
We accordingly report FAR with Wilson 95% confidence intervals against the large inter-CPA negative anchor in Table X.
|
||||||
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
|
The primary quantity reported by Table X is FAR: the probability that a random pair of signatures from *different* CPAs exceeds the candidate threshold.
|
||||||
@@ -159,12 +159,12 @@ We do not report an Equal Error Rate: EER is meaningful only when the positive a
|
|||||||
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
|
<!-- TABLE X: Cosine Threshold Sweep — FAR Against 50,000 Inter-CPA Negative Pairs
|
||||||
| Threshold | FAR | FAR 95% Wilson CI |
|
| Threshold | FAR | FAR 95% Wilson CI |
|
||||||
|-----------|-----|-------------------|
|
|-----------|-----|-------------------|
|
||||||
| 0.837 (all-pairs KDE crossover) | 0.2062 | [0.2027, 0.2098] |
|
| 0.837 (all-pairs KDE crossover) | 0.2101 | [0.2066, 0.2137] |
|
||||||
| 0.900 | 0.0233 | [0.0221, 0.0247] |
|
| 0.900 | 0.0250 | [0.0237, 0.0264] |
|
||||||
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
|
| 0.945 (calibration-fold P5 rounded) | 0.0008 | [0.0006, 0.0011] |
|
||||||
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0007 | [0.0005, 0.0009] |
|
| 0.950 (whole-sample Firm A P7.5; operational cut) | 0.0005 | [0.0003, 0.0007] |
|
||||||
| 0.973 (signature-level Beta/KDE upper bound) | 0.0003 | [0.0002, 0.0004] |
|
| 0.973 (signature-level Beta/KDE upper bound) | 0.0002 | [0.0001, 0.0004] |
|
||||||
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0002 | [0.0001, 0.0004] |
|
| 0.979 (signature-level Beta-2 forced-fit crossing) | 0.0001 | [0.0001, 0.0003] |
|
||||||
|
|
||||||
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
|
Table note: We do not include FRR against the byte-identical positive anchor as a column here: the byte-identical subset has cosine $\approx 1$ by construction, so FRR against that subset is trivially $0$ at every threshold below $1$ and carries no biometric information beyond verifying that the threshold does not exceed $1$. The conservative-subset FRR role of the byte-identical anchor is instead discussed qualitatively in Section V-F.
|
||||||
-->
|
-->
|
||||||
@@ -178,7 +178,7 @@ The very low FAR at the operational cut is therefore informative about specifici
|
|||||||
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
|
### 2) Held-Out Firm A Validation (within-Firm-A sampling variance disclosure)
|
||||||
|
|
||||||
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
|
We split Firm A CPAs randomly 70 / 30 at the CPA level into a calibration fold (124 CPAs, 45,116 signatures) and a held-out fold (54 CPAs, 15,332 signatures).
|
||||||
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two CPAs whose signatures could not be matched to a single assigned-accountant record because of disambiguation ties in the CPA registry and which we therefore exclude from both folds; this handling is made explicit here.
|
The total of 178 Firm A CPAs differs from the 180 in the Firm A registry by two registered Firm A partners whose signatures in the corpus are singletons (only one signature each, so the per-signature best-match cosine is undefined and they do not appear in the same-CPA matched-signature table that script `24_validation_recalibration.py` reads); they are therefore not represented in either fold by construction rather than by an explicit exclusion rule.
|
||||||
Thresholds are re-derived from calibration-fold percentiles only.
|
Thresholds are re-derived from calibration-fold percentiles only.
|
||||||
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
|
Table XI reports both calibration-fold and held-out-fold capture rates with Wilson 95% CIs and a two-proportion $z$-test.
|
||||||
|
|
||||||
@@ -340,7 +340,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
|
|||||||
## H. Classification Results
|
## H. Classification Results
|
||||||
|
|
||||||
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
|
Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
|
||||||
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
|
The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents have no signature whose extracted handwriting could be matched to a registered CPA name (every such signature has `assigned_accountant IS NULL` in the database, typically because the auditor's report page deviates from the standard two-signature layout or the OCRed printed CPA name was not present in the registry); the per-document classifier requires at least one CPA-matched signature so that a same-CPA best-match similarity exists, and these documents are therefore excluded from the classification reported here.
|
||||||
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
|
We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
|
||||||
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
|
Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
|
||||||
|
|
||||||
|
|||||||
@@ -85,44 +85,78 @@ def load_signatures():
|
|||||||
return rows
|
return rows
|
||||||
|
|
||||||
|
|
||||||
def load_feature_vectors_sample(n=2000):
|
def load_signature_ids_for_negative_pool(seed=SEED):
|
||||||
"""Load feature vectors for inter-CPA negative-anchor sampling."""
|
"""Load lightweight (sig_id, accountant) pool from the entire matched
|
||||||
|
corpus. Per Gemini round-19 review, the prior implementation drew
|
||||||
|
50,000 inter-CPA pairs from a tiny LIMIT-3000 random subset, reusing
|
||||||
|
each signature ~33 times and artificially tightening Wilson FAR CIs.
|
||||||
|
The corrected implementation samples pairs i.i.d. across the FULL
|
||||||
|
matched corpus (~168k signatures); only the unique signatures that
|
||||||
|
actually appear in the sampled pairs need feature vectors loaded.
|
||||||
|
"""
|
||||||
conn = sqlite3.connect(DB)
|
conn = sqlite3.connect(DB)
|
||||||
cur = conn.cursor()
|
cur = conn.cursor()
|
||||||
cur.execute('''
|
cur.execute('''
|
||||||
SELECT signature_id, assigned_accountant, feature_vector
|
SELECT signature_id, assigned_accountant
|
||||||
FROM signatures
|
FROM signatures
|
||||||
WHERE feature_vector IS NOT NULL
|
WHERE feature_vector IS NOT NULL
|
||||||
AND assigned_accountant IS NOT NULL
|
AND assigned_accountant IS NOT NULL
|
||||||
ORDER BY RANDOM()
|
''')
|
||||||
LIMIT ?
|
|
||||||
''', (n,))
|
|
||||||
rows = cur.fetchall()
|
rows = cur.fetchall()
|
||||||
conn.close()
|
conn.close()
|
||||||
out = []
|
sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
|
||||||
for r in rows:
|
accts = np.array([r[1] for r in rows])
|
||||||
vec = np.frombuffer(r[2], dtype=np.float32)
|
return sig_ids, accts
|
||||||
out.append({'sig_id': r[0], 'accountant': r[1], 'feature': vec})
|
|
||||||
return out
|
|
||||||
|
|
||||||
|
|
||||||
def build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS, seed=SEED):
|
def load_features_for_ids(sig_ids):
|
||||||
"""Sample random cross-CPA pairs; return their cosine similarities."""
|
conn = sqlite3.connect(DB)
|
||||||
|
cur = conn.cursor()
|
||||||
|
placeholders = ','.join('?' * len(sig_ids))
|
||||||
|
cur.execute(
|
||||||
|
f'SELECT signature_id, feature_vector FROM signatures '
|
||||||
|
f'WHERE signature_id IN ({placeholders})',
|
||||||
|
[int(s) for s in sig_ids],
|
||||||
|
)
|
||||||
|
rows = cur.fetchall()
|
||||||
|
conn.close()
|
||||||
|
feat_by_id = {}
|
||||||
|
for sid, blob in rows:
|
||||||
|
feat_by_id[int(sid)] = np.frombuffer(blob, dtype=np.float32)
|
||||||
|
return feat_by_id
|
||||||
|
|
||||||
|
|
||||||
|
def build_inter_cpa_negative(sig_ids, accts, n_pairs=N_INTER_PAIRS, seed=SEED):
|
||||||
|
"""Sample i.i.d. random cross-CPA pairs from the full matched corpus
|
||||||
|
and return their cosine similarities.
|
||||||
|
"""
|
||||||
rng = np.random.default_rng(seed)
|
rng = np.random.default_rng(seed)
|
||||||
n = len(sample)
|
n = len(sig_ids)
|
||||||
feats = np.stack([s['feature'] for s in sample])
|
pairs = []
|
||||||
accts = np.array([s['accountant'] for s in sample])
|
|
||||||
sims = []
|
|
||||||
tries = 0
|
tries = 0
|
||||||
while len(sims) < n_pairs and tries < n_pairs * 10:
|
seen_pairs = set()
|
||||||
|
while len(pairs) < n_pairs and tries < n_pairs * 10:
|
||||||
i = rng.integers(n)
|
i = rng.integers(n)
|
||||||
j = rng.integers(n)
|
j = rng.integers(n)
|
||||||
if i == j or accts[i] == accts[j]:
|
if i == j or accts[i] == accts[j]:
|
||||||
tries += 1
|
tries += 1
|
||||||
continue
|
continue
|
||||||
sim = float(feats[i] @ feats[j])
|
a, b = (i, j) if i < j else (j, i)
|
||||||
sims.append(sim)
|
if (a, b) in seen_pairs:
|
||||||
|
tries += 1
|
||||||
|
continue
|
||||||
|
seen_pairs.add((a, b))
|
||||||
|
pairs.append((a, b))
|
||||||
tries += 1
|
tries += 1
|
||||||
|
|
||||||
|
needed_ids = sorted({int(sig_ids[i]) for pair in pairs for i in pair})
|
||||||
|
feat_by_id = load_features_for_ids(needed_ids)
|
||||||
|
|
||||||
|
sims = []
|
||||||
|
for i, j in pairs:
|
||||||
|
fi = feat_by_id[int(sig_ids[i])]
|
||||||
|
fj = feat_by_id[int(sig_ids[j])]
|
||||||
|
sims.append(float(fi @ fj))
|
||||||
return np.array(sims)
|
return np.array(sims)
|
||||||
|
|
||||||
|
|
||||||
@@ -212,9 +246,12 @@ def main():
|
|||||||
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
|
print(f'Firm A signatures: {int(firm_a_mask.sum()):,}')
|
||||||
|
|
||||||
# --- (1) INTER-CPA NEGATIVE ANCHOR ---
|
# --- (1) INTER-CPA NEGATIVE ANCHOR ---
|
||||||
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} pairs)...')
|
print(f'\n[1] Building inter-CPA negative anchor ({N_INTER_PAIRS} '
|
||||||
sample = load_feature_vectors_sample(n=3000)
|
f'i.i.d. pairs from full matched corpus)...')
|
||||||
inter_cos = build_inter_cpa_negative(sample, n_pairs=N_INTER_PAIRS)
|
pool_sig_ids, pool_accts = load_signature_ids_for_negative_pool()
|
||||||
|
print(f' pool size: {len(pool_sig_ids):,} matched signatures')
|
||||||
|
inter_cos = build_inter_cpa_negative(pool_sig_ids, pool_accts,
|
||||||
|
n_pairs=N_INTER_PAIRS)
|
||||||
print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, '
|
print(f' inter-CPA cos: mean={inter_cos.mean():.4f}, '
|
||||||
f'p95={np.percentile(inter_cos, 95):.4f}, '
|
f'p95={np.percentile(inter_cos, 95):.4f}, '
|
||||||
f'p99={np.percentile(inter_cos, 99):.4f}, '
|
f'p99={np.percentile(inter_cos, 99):.4f}, '
|
||||||
|
|||||||
@@ -0,0 +1,123 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Script 29: Firm A Per-Year Cosine Distribution (Table XIII)
|
||||||
|
============================================================
|
||||||
|
Generates the year-by-year Firm A per-signature best-match cosine
|
||||||
|
distribution reported as Table XIII in the manuscript. Codex / Gemini
|
||||||
|
round-19 review identified that this table previously had no dedicated
|
||||||
|
generating script (Appendix B incorrectly attributed it to Script 08,
|
||||||
|
which has no year_month extraction).
|
||||||
|
|
||||||
|
Definition:
|
||||||
|
Firm A membership is via CPA registry (accountants.firm joined on
|
||||||
|
signatures.assigned_accountant), matching the convention used by
|
||||||
|
scripts 24 and 28.
|
||||||
|
|
||||||
|
For each fiscal year (substr(year_month, 1, 4)):
|
||||||
|
- N signatures with non-null max_similarity_to_same_accountant
|
||||||
|
- mean of max_similarity_to_same_accountant (the per-signature
|
||||||
|
best-match cosine)
|
||||||
|
- share with max_similarity_to_same_accountant < 0.95 (the
|
||||||
|
left-tail rate cited in Section IV-G.1)
|
||||||
|
|
||||||
|
Output:
|
||||||
|
reports/firm_a_yearly/firm_a_yearly_distribution.json
|
||||||
|
reports/firm_a_yearly/firm_a_yearly_distribution.md
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sqlite3
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||||
|
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||||
|
'firm_a_yearly')
|
||||||
|
OUT.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
FIRM_A = '勤業眾信聯合'
|
||||||
|
|
||||||
|
|
||||||
|
def yearly_distribution(conn):
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("""
|
||||||
|
SELECT substr(s.year_month, 1, 4) AS year,
|
||||||
|
COUNT(*) AS n_sigs,
|
||||||
|
AVG(s.max_similarity_to_same_accountant) AS mean_cos,
|
||||||
|
SUM(CASE
|
||||||
|
WHEN s.max_similarity_to_same_accountant < 0.95
|
||||||
|
THEN 1 ELSE 0
|
||||||
|
END) AS n_below_095
|
||||||
|
FROM signatures s
|
||||||
|
JOIN accountants a ON s.assigned_accountant = a.name
|
||||||
|
WHERE a.firm = ?
|
||||||
|
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||||
|
AND s.year_month IS NOT NULL
|
||||||
|
GROUP BY year
|
||||||
|
ORDER BY year
|
||||||
|
""", (FIRM_A,))
|
||||||
|
|
||||||
|
rows = []
|
||||||
|
for year, n_sigs, mean_cos, n_below in cur.fetchall():
|
||||||
|
rows.append({
|
||||||
|
'year': int(year),
|
||||||
|
'n_signatures': n_sigs,
|
||||||
|
'mean_best_match_cosine': round(mean_cos, 4),
|
||||||
|
'n_below_cosine_095': n_below,
|
||||||
|
'pct_below_cosine_095': round(100.0 * n_below / n_sigs, 2),
|
||||||
|
})
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def write_markdown(payload, path):
|
||||||
|
rows = payload['yearly_rows']
|
||||||
|
lines = []
|
||||||
|
lines.append('# Firm A Per-Year Cosine Distribution (Table XIII)')
|
||||||
|
lines.append('')
|
||||||
|
lines.append(f"Generated at: {payload['generated_at']}")
|
||||||
|
lines.append('')
|
||||||
|
lines.append('Firm A membership: CPA registry '
|
||||||
|
'(accountants.firm = "勤業眾信聯合"). Per-signature '
|
||||||
|
'best-match cosine = '
|
||||||
|
'signatures.max_similarity_to_same_accountant.')
|
||||||
|
lines.append('')
|
||||||
|
lines.append('| Year | N sigs | mean best-match cosine | % below 0.95 |')
|
||||||
|
lines.append('|------|--------|------------------------|--------------|')
|
||||||
|
for r in rows:
|
||||||
|
lines.append(
|
||||||
|
f"| {r['year']} | {r['n_signatures']:,} | "
|
||||||
|
f"{r['mean_best_match_cosine']:.4f} | "
|
||||||
|
f"{r['pct_below_cosine_095']:.2f}% |"
|
||||||
|
)
|
||||||
|
path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
conn = sqlite3.connect(DB)
|
||||||
|
try:
|
||||||
|
payload = {
|
||||||
|
'generated_at': datetime.now().isoformat(timespec='seconds'),
|
||||||
|
'database_path': DB,
|
||||||
|
'firm_a_label': FIRM_A,
|
||||||
|
'firm_a_membership_definition': (
|
||||||
|
'CPA registry: accountants.firm joined on '
|
||||||
|
'signatures.assigned_accountant'
|
||||||
|
),
|
||||||
|
'cosine_metric': 'signatures.max_similarity_to_same_accountant',
|
||||||
|
'yearly_rows': yearly_distribution(conn),
|
||||||
|
}
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
json_path = OUT / 'firm_a_yearly_distribution.json'
|
||||||
|
json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
|
||||||
|
encoding='utf-8')
|
||||||
|
print(f'Wrote {json_path}')
|
||||||
|
|
||||||
|
md_path = OUT / 'firm_a_yearly_distribution.md'
|
||||||
|
write_markdown(payload, md_path)
|
||||||
|
print(f'Wrote {md_path}')
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user