Add codex GPT-5.5 round-16 independent peer review artifact

paper/codex_review_gpt55_v3_18_1.md: 28.6 KB / 224 lines, archived for reference. Verdict: Minor Revision (broke a 15-round Accept-anchor chain by independently auditing every quantitative claim against scripts and JSON reports). Flagged the previously-cited cross-firm 11.3% / 58.7% numbers as UNVERIFIABLE; subsequent DB recomputation confirmed they were incorrect (true values 42.12% / 88.32%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings
2026-04-27 20:23:15 +08:00 · 2026-04-27 20:23:08 +08:00
10 changed files with 523 additions and 53 deletions
@@ -0,0 +1,224 @@
 # Independent Peer Review (Round 16) - Paper A v3.18.1
 Reviewer role: independent peer reviewer for IEEE Access Regular Paper.
 Manuscript reviewed: "Replication-Dominated Calibration" - CPA signature analysis, v3.18.1, commit `cb77f481ec2ab4b93b0effbf4c0ee4c89e90d610`.
 Audit basis: manuscript sections under `paper/`, analysis scripts under `signature_analysis/`, generated reports under `/Volumes/NV2/PDF-Processing/signature-analysis/reports/`, and `paper/reference_verification_v3.md`.
 ## 1. Overall Verdict: Minor Revision
 The paper is close to submission-ready and the central empirical story is largely reproducible from the provided scripts: a large Taiwan audit-report corpus; a signature-detection and feature-extraction pipeline; percentile-calibrated dual-descriptor classification; annotation-free validation using byte-identical positives and inter-CPA negatives; and strong Firm A concentration in several benchmark checks. I did not find a surviving "30/30 human rater agreement" claim in the current manuscript.
 However, I would not recommend unconditional Accept. Three issues require revision before IEEE Access submission:
 1. Several claims are empirically supported but still phrased more strongly than the scripts justify, especially "detects non-hand-signed signatures," "single dominant generative mechanism," and statements that Firm A's industry practice is "widely understood" or majority non-hand-signing. The data support replication-dominated calibration evidence, not a direct observation of signing workflow.
 2. A number of section references are stale after the v3.18 retitling/reframing. The most visible are references to Section IV-F for analyses that now appear under Section IV-G, and Section III-K references "Firm A P5 percentile 0.941" while the reported sensitivity uses 0.945 and calibration-fold P5 is 0.9407.
 3. The empirical audit found no fabricated quantitative core result, but some claims are only partially reproducible from scripts because the generated tables are embedded as manuscript comments and some scripts contain legacy comments or outputs from earlier versions (e.g., EER/precision/F1 code still present in Script 21/19, although the manuscript correctly omits those metrics).
 These are Minor rather than Major because the numerical tables I checked generally match the scripts/reports, the prior fabricated rater-agreement problem appears removed, and the manuscript now contains appropriate limitations around annotation-free anchors and signature-level scope.
 ## 2. Empirical-Claim Audit Table
 Status definitions: VERIFIED = matches scripts/reports or reference verification; UNVERIFIABLE = plausible but not independently supported by provided artifacts; SUSPICIOUS = likely true directionally but overphrased or internally inconsistent; FABRICATED = contradicted by provided artifacts or unsupported despite being presented as measured fact. I found no clear fabricated quantitative claim in v3.18.1.
 | Claim | Location | Status | Audit basis / notes |
 |---|---:|---|---|
 | 90,282 audit-report PDFs, Taiwan, 2013-2023 | Abstract; III-B; V | VERIFIED | Manuscript dataset summary; pipeline comments. No raw download log audited, but internally consistent across III-B and conclusion. |
 | 86,072 documents with signatures (95.4%); 12 corrupted PDFs excluded; final 86,071 documents | III-B/C/D; Table I/III | VERIFIED | III-C explains 86,072 VLM-positive minus 12 corrupted = 86,071 final. Slight table split is clear enough. |
 | 182,328 extracted signatures | Abstract; III-D; IV-B; conclusion | VERIFIED | Table III and scripts using DB counts; `signature_analysis/21_expanded_validation.py` loads 168,740 post-best-match subset, consistent with matched subset after exclusions. |
 | 758 unique CPAs; >50 accounting firms; 15 document types, 86.4% standard audit reports | III-B/Table I | VERIFIED for 758 and >50; UNVERIFIABLE for 15/86.4 | 758 is repeatedly used in manuscript. I did not find a direct script/report cross-check for the 15 document-type and 86.4% breakdown in the inspected artifacts. |
 | Qwen2.5-VL 32B; first quartile scanning; temperature 0 | III-C | UNVERIFIABLE | Method claim, not contradicted, but no config/output file inspected establishes these exact inference settings. |
 | VLM-YOLO agreement / YOLO detections in 98.8% of VLM-positive documents | Abstract; III-C; IV-B | VERIFIED | Table III: 85,042 / 86,071 = 98.8%. Script provenance not fully traced, but arithmetic and manuscript consistency are correct. |
 | YOLO training set 500 pages, 425/75 split, 100 epochs | III-D; IV-B | VERIFIED with caveat | Method statement; no training logs inspected. The 425/75 split is arithmetically consistent. |
 | YOLO metrics: precision 0.97-0.98, recall 0.95-0.98, mAP@0.50 0.98-0.99, mAP@0.50:0.95 0.85-0.90 | Table II | UNVERIFIABLE | I did not find a training-results artifact in `signature_analysis/`; claim may be true but needs reproducible log/table in supplement. |
 | Detection deployment: 43.1 docs/sec with 8 workers | III-D; Table III | UNVERIFIABLE | Reported in Table III; no script/log inspected verifies runtime. |
 | CPA-matched signatures: 168,755 / 182,328 = 92.6%; unmatched 13,573 = 7.4% | III-D; Table III | VERIFIED | 168,755 + 13,573 = 182,328; percentages correct. |
 | Same-CPA best-match analyses use N = 168,740, 15 fewer than matched count due to singleton CPAs | IV-D.1 | VERIFIED | `signature_analysis/15_hartigan_dip_test.py` and reports use N=168,740; explanation is plausible and internally consistent. |
 | ResNet-50, ImageNet-1K V2, 2048-d embeddings, 224x224 preprocessing, L2 normalization | III-E | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py`, `paper/ablation_backbone_comparison.py`. |
 | All-pairs intra-class N = 41,352,824; inter-class N = 500,000 | Table IV | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py` computes all intra-pairs and samples 500,000 inter-pairs. |
 | Table IV distribution stats: intra mean 0.821, inter mean 0.758, std/median/skew/kurtosis | IV-C/Table IV | VERIFIED | Consistent with formal statistical report logic and Table XVIII ResNet stats; exact JSON not fully quoted here but no contradiction found. |
 | Shapiro-Wilk and K-S reject normality, p < 0.001 | IV-C | VERIFIED with caveat | `signature_analysis/10_formal_statistical_analysis.py` performs tests. Large paired dependence caveat is correctly acknowledged later. |
 | Lognormal best parametric fit by AIC | IV-C | UNVERIFIABLE | Mentioned in manuscript; not confirmed in inspected code excerpt/output. Needs report citation or supplement table. |
 | KDE crossover at 0.837; Cohen's d = 0.669; Mann-Whitney p < 0.001; K-S p < 0.001 | IV-C/Table V | VERIFIED | `signature_analysis/10_formal_statistical_analysis.py` computes these categories; Table XVIII also repeats ResNet crossover/d. |
 | Pairwise p-values unreliable due to non-independence | IV-C | VERIFIED as methodological caveat | Correct; same signature appears in many pairs. |
 | Firm A cosine dip: N=60,448, dip=0.0019, p=0.169, unimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`; `signature_analysis/15_hartigan_dip_test.py`. |
 | Firm A dHash dip: N=60,448, dip=0.1051, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
 | All-CPA cosine dip: N=168,740, dip=0.0035, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
 | All-CPA dHash dip: N=168,740, dip=0.0468, p<0.001, multimodal | IV-D.1/Table V | VERIFIED | `/reports/dip_test/dip_test_results.json`. |
 | Firm A cosine distribution "reflects a single dominant generative mechanism" | IV-D.1 | SUSPICIOUS | Dip p=0.17 supports failure to reject unimodality, not direct mechanism identification. Rewrite as "consistent with" rather than "reflecting." |
 | BD/McCrary Firm A cosine transition 0.985 at bin 0.005; full 0.985; dHash transition 2 | IV-D.2; Appendix A | VERIFIED | `signature_analysis/25_bd_mccrary_sensitivity.py`; `/reports/bd_sensitivity/bd_sensitivity.json`. |
 | BD transition drift: Firm A cosine 0.987/0.985/0.980/0.975 as bin widens; full dHash 2/10/9 | Appendix A | VERIFIED | `/reports/bd_sensitivity/bd_sensitivity.json`. |
 | BD/McCrary transition lies inside non-hand-signed mode and is not bin-width-stable | IV-D.2; Appendix A | VERIFIED as interpretation | Script supports instability. "Inside mode" is interpretive but reasonable given Firm A high-similarity mass. |
 | Beta mixture: Firm A Delta BIC = 381 preferring K=3; full-sample Delta BIC = 10,175 | IV-D.3; V-B | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`: -371092.8 vs -371473.9; -787280.4 vs -797455.1. |
 | Firm A forced Beta-2 crossing 0.977; logit-GMM crossing 0.999 | IV-D.3/Table VI | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`: 0.9774276 and 0.9992143. |
 | Full-sample forced Beta crossing none; logit-GMM 0.980 | IV-D.3/Table VI | VERIFIED | `/reports/beta_mixture/beta_mixture_results.json`. |
 | Operational Firm A P7.5 cosine cut: cos > 0.95; 92.5% above / 7.5% at or below | Abstract; III-H/K; IV-E | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`: Firm A cosine>0.95 = 0.9251257. |
 | dHash cutoffs <=5, <=8, <=15; Firm A dHash median 2; P75 approx 4; P95 9 | III-K; IV-E/F | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json` and pixel-validation JSON. |
 | Firm A whole-sample capture: cos>0.837 99.93%, cos>0.9407 95.15%, cos>0.945 94.02%, cos>0.95 92.51% | Table IX | VERIFIED mostly | Counts/rates match manuscript except pixel JSON has 0.941 rather than 0.9407 from older run; recalibration JSON supports 0.9407 threshold family. |
 | Firm A whole-sample dHash<=5 84.20%, <=8 95.17%, <=15 99.83%, dual cos>0.95 AND dHash<=8 89.95% | Table IX; abstract | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`; `/reports/validation_recalibration/validation_recalibration.json`. |
 | 310 byte-identical positives | Abstract; IV-F.1; V-F | VERIFIED | `signature_analysis/19_pixel_identity_validation.py`; `/reports/pixel_validation/pixel_validation_results.json`; `/reports/expanded_validation/expanded_validation_results.json`. |
 | 145 Firm A byte-identical signatures, 50 distinct Firm A partners of 180, 35 cross-year | III-H; V-C; conclusion | VERIFIED with caveat | The manuscript cites this, but the inspected `pixel_validation_results.json` reports only 310 all-sample pixel-identical signatures. I did not inspect an output table listing the 145/50/35 decomposition. Treat as verified only if the supplementary byte-level pair table is included; otherwise demote to UNVERIFIABLE. |
 | 50,000 inter-CPA negative pairs; inter-CPA mean=0.762, P95=0.884, P99=0.913, max=0.988 | IV-F.1 | VERIFIED | `signature_analysis/21_expanded_validation.py`; `/reports/expanded_validation/expanded_validation_results.json`. |
 | Table X FAR at thresholds: 0.837 -> 0.2062; 0.900 -> 0.0233; 0.945 -> 0.0008; 0.950 -> 0.0007; 0.973 -> 0.0003; 0.979 -> 0.0002, Wilson CIs | IV-F.1/Table X | VERIFIED | `/reports/expanded_validation/expanded_validation_results.json`. |
 | Omission of EER/FRR/precision/F1 in Table X because anchor prevalence is arbitrary and byte-identical positives make FRR trivial | III-J; IV-F.1 | VERIFIED methodologically | Correct manuscript correction. Scripts still compute legacy EER/precision/F1 in places; manuscript appropriately omits. |
 | Low-similarity same-CPA negative anchor n=35 | III-J; V-G | VERIFIED | `/reports/pixel_validation/pixel_validation_results.json`. |
 | Firm A 70/30 CPA split: 124 CPAs/45,116 signatures vs 54 CPAs/15,332 signatures | IV-F.2/Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`; `signature_analysis/24_validation_recalibration.py`. |
 | 178 Firm A CPAs in split vs 180 registry; two excluded for disambiguation ties | IV-F.2 | UNVERIFIABLE | Plausible and internally consistent, but I did not find a script/report field documenting the two disambiguation ties. |
 | Calibration-fold thresholds: cosine median 0.9862, P1 0.9067, P5 0.9407; dHash median 2, P95 9 | Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`; `/reports/expanded_validation/expanded_validation_results.json`. |
 | Table XI fold rates and z-tests | IV-F.2/Table XI | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. |
 | Claim: extreme rules agree across folds, operational 85-95% rules differ by 1-5 points, p<0.001 | IV-F.2; conclusion | VERIFIED | Recalibration JSON supports this. |
 | Sensitivity: cos>0.95 vs cos>0.945 reclassifies 8,508 signatures; category counts in Table XII | IV-F.3/Table XII | VERIFIED | `/reports/validation_recalibration/validation_recalibration.json`. |
 | Firm A dual capture shifts from 89.95% to 91.14%, +1.19 pp | IV-F.3 | VERIFIED | Recalibration JSON: 0.89945 vs 0.91138. |
 | Text says "Firm A P5 percentile 0.941" but sensitivity uses 0.945 | III-K | SUSPICIOUS | Calibration-fold P5 is 0.9407; deployed sensitivity cut is 0.945. Revise to avoid "P5 percentile 0.941" vs "0.945 rounded" ambiguity. |
 | Year-by-year Firm A left-tail table, 2013-2023 N/mean/% below 0.95 | IV-G.1/Table XIII | VERIFIED with caveat | Values plausible and internally consistent, but I did not find the specific report output in inspected files. Include generating script/table in supplement. |
 | 2013-2019 mean left-tail 8.26%, 2020-2023 mean 6.96%; lowest 2023 = 3.75% | IV-G.1 | VERIFIED arithmetically from Table XIII | Means computed from unweighted annual percentages. If intended signature-weighted means, disclose. |
 | Partner ranking: 4,629 auditor-years >=5 signatures; Firm A 1,287 baseline 27.8%; top decile 443/462 = 95.9%; top quartile 1,043/1,157 = 90.1%; top half 1,220/2,314 = 52.7% | IV-G.2/Table XIV | VERIFIED | `signature_analysis/22_partner_ranking.py`; `/reports/partner_ranking/partner_ranking_results.json`. |
 | Year-by-year top-decile Firm A share range 88.4%-100% | IV-G.2/Table XV | VERIFIED | `/reports/partner_ranking/partner_ranking_results.json`. |
 | Intra-report corpus: 84,354 two-signer reports; 83,970 single-firm; 384 mixed-firm = 0.46% | IV-G.3 | VERIFIED | `/reports/intra_report/intra_report_results.json` gives same-firm totals plus mixed-firm categories adding to 384. |
 | Intra-report Table XVI: Firm A 30,222 reports, agreement 89.91%; other Big-4 62-67%; 23-28 pp gap | IV-G.3/Table XVI; abstract | VERIFIED | `signature_analysis/23_intra_report_consistency.py`; `/reports/intra_report/intra_report_results.json`. |
 | Firm A both non-hand-signed 26,435/30,222 = 87.5%; both likely hand-signed 4 = 0.01% | IV-G.3 | VERIFIED | `/reports/intra_report/intra_report_results.json`. |
 | Intra-report gap "predicted by firm-wide practice" | IV-G.3 | SUSPICIOUS | Pattern is consistent with firm-wide practice, but not uniquely diagnostic. Use "consistent with" and avoid "sharp discontinuity" unless statistical uncertainty/sensitivity is shown. |
 | Document-level classification cohort 84,386; differs from 85,042 detections by 656 single-signature documents | IV-H/Table XVII | VERIFIED | Legacy PDF verdict report reports total 84,386; explanation internally consistent. |
 | Table XVII document counts: high 29,529; moderate 36,994; style 5,133; uncertain 12,683; likely 47; total 84,386 | IV-H/Table XVII | VERIFIED | Sum = 84,386; consistent with text. |
 | Within 71,656 documents exceeding cosine 0.95: 41.2% high, 51.7% moderate, 7.2% style-only | IV-H | VERIFIED | 29,529 + 36,994 + 5,133 = 71,656; percentages correct. |
 | Abstract says "only 41% exhibit converging structural evidence ... 7% show no structural corroboration" | Abstract/conclusion | VERIFIED with caveat | Correct for documents with cos>0.95, but "only" is rhetoric; moderate 51.7% still has partial structural similarity. |
 | Firm A document capture: 96.9% high/moderate, 0.6% style, 2.5% uncertain, 4/30,226 likely hand-signed | IV-H.1 | VERIFIED | Table XVII Firm A counts sum to 30,226; 22,970+6,311=29,281=96.9%. |
 | Cross-firm dual-descriptor convergence: non-Firm-A CPAs with cos>0.95 have dHash<=5 at 11.3%, Firm A 58.7% | IV-H.2 | UNVERIFIABLE | I did not find a direct output artifact for this exact comparison in inspected scripts/reports. Add reproducible table or script reference. |
 | Ablation Table XVIII: ResNet/VGG/EfficientNet dimensions and stats | IV-I/Table XVIII | VERIFIED with caveat | `paper/ablation_backbone_comparison.py` implements analysis; I did not inspect generated JSON under ablation. |
 | Claim ResNet-50 "best balance" over EfficientNet-B0 despite lower Cohen's d | IV-I; conclusion | VERIFIED as judgment, not a pure metric | The chosen tradeoff is defensible but subjective. Do not overstate as a purely empirical optimum. |
 | Reference verification: [5] fixed to Kao and Wen; [16]/[21]/[22]/[25] corrected/polished | References; reference_verification_v3.md | VERIFIED | Current `paper_a_references_v3.md` reflects the critical [5] correction and most polish recommendations. |
 | "30/30 human rater agreement" | Current manuscript | VERIFIED ABSENT | `rg` found no surviving 30/30/rater agreement claim in manuscript sections. |
 ## 3. Methodological Rigor
 The methodological core is substantially stronger than in earlier described versions. The key positive points are:
 - The paper now separates operational calibration from descriptive distributional diagnostics. This is the right move: the signature-level dip/Beta/BD results do not converge to a clean two-mechanism threshold, so a transparent Firm A percentile anchor is more defensible than a forced mixture crossing.
 - The dual-descriptor classifier is methodologically sensible. Cosine captures high-level similarity; independent-minimum dHash adds structural near-duplicate evidence and avoids treating all high-cosine signatures as image reproduction.
 - The pixel-identity positive anchor is valid as a conservative subset, and the manuscript now correctly avoids presenting FRR/EER/precision/F1 against that artificial anchor set as biometric performance.
 - The inter-CPA negative anchor is a meaningful improvement over the n=35 low-similarity same-CPA anchor.
 - The 70/30 Firm A split is a useful disclosure of within-anchor heterogeneity, even though it is not external validation in the ordinary supervised-learning sense.
 Remaining rigor concerns:
 1. The inference from "Firm A dip p=0.17" to "single dominant generative mechanism" is too strong. A dip-test non-rejection means the data are consistent with unimodality; it does not identify a generative mechanism. The replication-dominated story is supported by the joint evidence, not by the dip result alone.
 2. The Firm A "industry practice is widely understood" claim is background knowledge, not reproducible evidence. It is acceptable as motivation, but not as an evidentiary premise unless the source is documented. The paper says the evidence comes from image analyses, which is good; the wording should keep practitioner knowledge clearly non-load-bearing.
 3. The dHash thresholds are reasonable but still heuristic. The text says the dHash cuts are "on the same reference"; this should specify exactly: whole-sample Firm A distribution, median/P75-ish high band, and style-consistency ceiling at >15.
 4. The BD/McCrary implementation is a custom adjacent-bin diagnostic rather than a standard local-polynomial McCrary density test. The manuscript already frames it as a diagnostic; it should also avoid implying full equivalence to canonical McCrary RDD density testing.
 5. The partner-ranking statistic uses each year's signatures' max similarity to the CPA's full cross-year pool. The paper notes this, but the "auditor-year" label can mislead readers into assuming within-year-only similarity. The untracked `signature_analysis/27_within_year_uniformity.py` suggests this sensitivity is being explored; if not included, the limitation should be more explicit.
 ## 4. Narrative Discipline
 The narrative is much more disciplined than prior-round summaries suggested, but it still needs tightening.
 Overclaims / scope creep:
 - "Detects non-hand-signed signatures" should usually be "classifies signatures as replication-consistent / non-hand-signed under a calibrated dual-descriptor rule." The system detects image-reuse evidence, not the signing workflow itself.
 - "Undermining individualized attestation" is plausible but legal/regulatory, not empirically established by the pipeline. It is acceptable in the introduction/impact statement if phrased as a concern, not a measured outcome.
 - "From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise" is too absolute. Multiple templates, role-specific templates, or system upgrades can break the "single stored image" assumption. The methodology later acknowledges multi-template regimes; the introduction/method overview should match that nuance.
 - "This sharp discontinuity ... predicted by firm-wide non-hand-signing practice" should be softened to "consistent with." A cross-firm agreement gap can arise from classifier calibration, firm-specific document-production pipelines, or signer mix.
 - The conclusion says the replication-dominated calibration strategy is "directly generalizable" to settings with a dominant reference subpopulation and byte-level trace. This is plausible, but "directly" is too strong; generalization depends on the presence of analogous anchors and artifact-generation physics.
 Scope discipline that works well:
 - The paper now repeatedly states that signature-level rates are not partner-level frequencies.
 - The held-out Firm A fold is correctly presented as within-Firm-A sampling variance disclosure rather than external proof.
 - The byte-identical anchor is correctly framed as a conservative subset, not recall ground truth for all positives.
 ## 5. IEEE Access Fit
 IEEE Access fit is good. The work is application-driven, computational, reproducible in spirit, and interdisciplinary across document forensics, audit regulation, and computer vision. The novelty is not in a new neural architecture but in the calibration/validation design for a difficult real-world forensic corpus. That is a reasonable IEEE Access contribution if the manuscript is careful about claims.
 Rigor is adequate for a Regular Paper after minor revisions. The main technical limitation is absence of a boundary-focused manual adjudication set, but the paper acknowledges this and offers a coherent annotation-free validation strategy. Reproducibility would improve if the authors bundle the generated JSON/Markdown reports or explicitly map each table to its script/report path.
 Clarity is mostly high, but the section-number drift and the 0.941/0.945 wording need cleanup before submission. IEEE Access reviewers will notice stale cross-references.
 ## 6. Specific Actionable Revisions and Proposed Rewrites
 1. Soften mechanism-identification language.
 Current:
 "Firm A's per-signature cosine distribution is unimodal (p = 0.17), reflecting a single dominant generative mechanism..."
 Proposed:
 "Firm A's per-signature cosine distribution fails to reject unimodality (p = 0.17), a pattern consistent with a dominant high-similarity regime plus a long left tail. We interpret this jointly with the byte-identity, ranking, and intra-report evidence as supporting the replication-dominated calibration framing."
 2. Remove overabsolute "single stored image on every report" wording.
 Current:
 "both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise."
 Proposed:
 "both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise."
 3. Clarify practitioner-knowledge status.
 Current:
 "industry practice at the firm is widely understood among practitioners..."
 Proposed:
 "Practitioner knowledge motivated treating Firm A as a candidate calibration reference, but the evidentiary basis used in this paper is the observable image evidence reported below: byte-identical same-CPA pairs, the Firm A similarity distribution, partner-ranking concentration, and intra-report consistency."
 4. Fix section-reference drift.
 Examples:
 - III-H says the three complementary analyses are in Section IV-F; in the current manuscript they are in Section IV-G.
 - III-H bullet labels cite IV-F.1/IV-F.2/IV-F.3 for longitudinal, ranking, intra-report; these should be IV-G.1/IV-G.2/IV-G.3.
 - Results IV-F.2 final sentence says "threshold-independent partner-ranking analysis (Section IV-F.2)" but ranking is Section IV-G.2.
 - Methodology III-G says partner-level ranking is Section IV-F.2; update to IV-G.2.
 5. Fix the 0.941/0.945 sensitivity wording.
 Current:
 "replacing 0.95 with the slightly stricter Firm A P5 percentile 0.941 alters aggregate firm-level capture rates by at most approx 1.2 percentage points"
 Proposed:
 "replacing 0.95 with the nearby rounded sensitivity cut 0.945 (motivated by the calibration-fold P5 = 0.9407) shifts whole-Firm-A dual-rule capture by 1.19 percentage points."
 6. Add table-to-script provenance.
 Add a compact appendix table:
 | Manuscript table | Reproduction artifact |
 |---|---|
 | Table V | `signature_analysis/15_hartigan_dip_test.py`; `reports/dip_test/dip_test_results.json` |
 | Table VI | `signature_analysis/17_beta_mixture_em.py`; `reports/beta_mixture/beta_mixture_results.json`; `signature_analysis/25_bd_mccrary_sensitivity.py` |
 | Table X | `signature_analysis/21_expanded_validation.py`; `reports/expanded_validation/expanded_validation_results.json` |
 | Table XI/XII | `signature_analysis/24_validation_recalibration.py`; `reports/validation_recalibration/validation_recalibration.json` |
 | Table XIV/XV | `signature_analysis/22_partner_ranking.py`; `reports/partner_ranking/partner_ranking_results.json` |
 | Table XVI | `signature_analysis/23_intra_report_consistency.py`; `reports/intra_report/intra_report_results.json` |
 7. Either document or remove exact unverifiable decomposition claims.
 For "145 Firm A signatures across 50 partners of 180, 35 cross-year," include the exact script/report path that generates the decomposition. If no reproducible artifact is packaged, rewrite as:
 "A subset of Firm A byte-identical matches is distributed across many partners; the supplementary byte-identity table reports the exact partner and cross-year counts."
 8. Treat "cross-firm dual convergence 11.3% vs 58.7%" as a table or remove it.
 This is a useful claim, but I did not find a direct reproduction artifact. Add a small table with counts/denominators and script provenance.
 9. Tighten the impact statement.
 Current:
 "automatically extracts and analyzes signatures from over 90,000 audit reports..."
 This is accurate. But:
 "separate hand-written signatures from reproduced ones" should remain removed/avoided. Use:
 "stratifies signatures by evidence of image reproduction."
 10. Clean legacy script comments before supplement release.
 Scripts 19 and 21 still contain old comments about EER/FRR/precision/F1 and "interview evidence." Even if the manuscript is corrected, reviewers who inspect code may see these as conceptual residue. Update comments to match the paper's current anchor-based evaluation language.
 ## 7. Disagreements with Prior Round-7 Gemini Accept Verdict
 I disagree with the round-7 Gemini "fully submission-ready / no v3.9 warranted" conclusion, not because the paper is weak, but because that verdict was too trusting of narrative closure.
 Specific disagreements:
 1. Gemini focused on prior blockers (BD/McCrary reframing, FRR/EER removal, 15-signature footnote) and did not perform a fresh empirical-claim audit. The known missed "30/30 human rater agreement" problem is exactly the kind of issue that survives when reviewers check only the last patch.
 2. Gemini praised the BD/McCrary rewrite as "perfectly calibrated," but the current paper still risks overstating the adjacent-bin diagnostic as a McCrary-style density test. It is now acceptable, but not perfect.
 3. Gemini treated the paper as "fully submission-ready" before the current Firm A replication-dominated framing was fully disciplined. v3.18.1 is better, but still contains overstrong mechanism phrases and practitioner-knowledge language that need tightening.
 4. Gemini did not flag stale cross-references and threshold wording inconsistencies. These are minor, but IEEE reviewers will see them as polish/reproducibility issues.
 5. Gemini's Accept posture likely reflects anchoring on accumulated prior Accept verdicts. The current manuscript should pass after minor revision, but the audit standard should be "can every quantitative and evidentiary claim be traced to an artifact?" not "did the last known blocker get patched?"
 Bottom line: I recommend Minor Revision. The empirical core is credible and largely verified, no surviving fabricated rater-agreement claim was found, and the paper fits IEEE Access. The authors should revise the few overstrong claims and improve provenance/cross-reference hygiene before submission.
@@ -33,3 +33,30 @@ Taken together, Table A.I shows that the signature-level BD/McCrary transitions
 This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that per-signature similarity does not form a clean two-mechanism mixture.
 Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
 # Appendix B. Table-to-Script Provenance
 For reproducibility, the following table maps each numerical table in Section IV to the analysis script that produces its underlying values and to the JSON / Markdown report file emitted by that script. Scripts referenced are under `signature_analysis/` and reports under the project's `reports/` tree.
 <!-- TABLE B.I: Manuscript table → reproduction artifact
 | Manuscript table | Generating script | Report artifact |
 |------------------|-------------------|-----------------|
 | Table III (extraction results) | `02_extract_features.py`; `09_pdf_signature_verdict.py` | extraction logs (supplementary) |
 | Table IV (intra/inter all-pairs cosine statistics) | `10_formal_statistical_analysis.py` | `reports/formal_statistical/formal_statistical_results.json` |
 | Table V (Hartigan dip test) | `15_hartigan_dip_test.py` | `reports/dip_test/dip_test_results.json` |
 | Table VI (signature-level threshold-estimator summary) | `17_beta_mixture_em.py`; `25_bd_mccrary_sensitivity.py` | `reports/beta_mixture/beta_mixture_results.json`; `reports/bd_sensitivity/bd_sensitivity.json` |
 | Table IX (Firm A whole-sample capture rates) | `19_pixel_identity_validation.py`; `24_validation_recalibration.py` | `reports/pixel_validation/pixel_validation_results.json`; `reports/validation_recalibration/validation_recalibration.json` |
 | Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
 | Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
 | Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
 | Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | `reports/deloitte_distribution/deloitte_distribution_results.json` |
 | Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
 | Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
 | Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_level/pdf_level_results.json` |
 | Table XVIII (backbone ablation) | `paper/ablation_backbone_comparison.py` | `reports/ablation/ablation_results.json` |
 | Table A.I (BD/McCrary bin-width sensitivity) | `25_bd_mccrary_sensitivity.py` | `reports/bd_sensitivity/bd_sensitivity.json` |
 | Byte-identity decomposition (145 / 50 / 180 / 35; Section IV-F.1) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
 | Cross-firm dual-descriptor convergence (Section IV-H.2) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
 -->
 The table-to-script mapping above is intended as a navigation aid for replicators. All scripts run deterministically under the fixed random seeds documented in the supplementary materials; report files are committed alongside the scripts so that each numerical claim in Section IV traces to a specific JSON field rather than to an undocumented intermediate computation.
@@ -27,5 +27,5 @@ Several directions merit further investigation.
 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
 Extending the analysis to auditor-year units---computing per-signature statistics within each fiscal year and tracking how individual CPAs move across years---could reveal within-CPA transitions between hand-signing and non-hand-signing over the decade and is the natural next step beyond the cross-sectional analysis reported here.
 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
-The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
+The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
@@ -9,7 +9,7 @@ While the law permits either a handwritten signature or a seal, the CPA's attest
 The digitization of financial reporting has introduced a practice that complicates this intent.
 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
-From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
 The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
@@ -7,7 +7,7 @@ Fig. 1 illustrates the overall architecture.
 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.
 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
-From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
 <!--
 [Figure 1: Pipeline Architecture - clean vector diagram]
@@ -116,7 +116,7 @@ Cosine similarity and dHash are both robust to the noise introduced by the print
 Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; and (ii) the *auditor-year*---all signatures by one CPA within one fiscal year.
 The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G).
-The auditor-year is used in the partner-level similarity ranking of Section IV-F.2 as a deliberately within-year aggregation that avoids cross-year pooling.
+The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a deliberately within-year aggregation that avoids cross-year pooling.
 We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.
 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year).
@@ -136,29 +136,29 @@ We make *no* within-year or across-year uniformity assumption about CPA signing
 Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation.
 A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.
-The intra-report consistency analysis in Section IV-F.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
+The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
 ## H. Calibration Reference: Firm A as a Replication-Dominated Population
 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
-The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
+Practitioner knowledge motivated treating Firm A as a candidate calibration reference: it is widely held within the audit profession that the firm reproduces a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
-We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.
+This practitioner background is *non-load-bearing* in our analysis: the evidentiary basis used in this paper is the observable image evidence reported below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---which does not depend on any claim about signing practice beyond what the audit-report images themselves show.
 We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:
-First, *automated byte-level pair analysis* (Section IV-F.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
+First, *automated byte-level pair analysis* (Section IV-F.1; reproduced by `signature_analysis/28_byte_identity_decomposition.py` with output in `reports/byte_identity_decomp/byte_identity_decomposition.json`) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
 Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.
 Second, *signature-level distributional evidence*: Firm A's per-signature best-match cosine distribution is unimodal with a long left tail (Hartigan dip test $p = 0.17$ at $n \geq 10$ signatures; Section IV-D), consistent with a single dominant mechanism (non-hand-signing) plus residual within-firm heterogeneity rather than two cleanly separated mechanisms.
 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims).
-The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-F.1).
+The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).
-Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-F. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
+Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
-  (a) *Longitudinal stability (Section IV-F.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
+  (a) *Longitudinal stability (Section IV-G.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
-  (b) *Partner-level similarity ranking (Section IV-F.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
+  (b) *Partner-level similarity ranking (Section IV-G.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
-  (c) *Intra-report consistency (Section IV-F.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
+  (c) *Intra-report consistency (Section IV-G.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
 We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2).
@@ -280,7 +280,7 @@ High feature-level similarity without structural corroboration---consistent with
 We note three conventions about the thresholds.
 First, the cosine cutoff $0.95$ corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5% of whole-sample Firm A signatures exceed this cutoff and 7.5% fall at or below it (Section III-H)---chosen as a round-number lower-tail boundary whose complement (92.5% above) has a transparent interpretation in the whole-sample reference distribution; the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
-Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the slightly stricter Firm A P5 percentile $0.941$ alters aggregate firm-level capture rates by at most $\approx 1.2$ percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
+Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the nearby rounded sensitivity cut $0.945$ (motivated by the calibration-fold P5 = 0.9407, see Section IV-F.2) shifts whole-Firm-A dual-rule capture by 1.19 percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
 Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
 Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
 Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are *not* the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.
@@ -73,9 +73,9 @@ The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-sig
 | All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
 -->
-Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
+Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
-The Firm A unimodal-long-tail finding is the structural evidence that supports the replication-dominated framing (Section III-H): a single dominant mechanism plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
+The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
 ### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
@@ -204,7 +204,7 @@ Under this proper test the two extreme rules agree across folds (cosine $> 0.837
 The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
 Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
 The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
-We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-F.2) is the cross-check that is robust to this fold variance.
+We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
 ### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
@@ -332,7 +332,7 @@ A report is "in agreement" if both signature labels fall in the same coarse buck
 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
 The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
-This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
+This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.
 We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
@@ -341,7 +341,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
 Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
 The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
 We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
-Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-F.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
+Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
 <!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |
@@ -370,8 +370,9 @@ We note that because the non-hand-signed thresholds are themselves calibrated to
 ### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
-Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
+Among the 65,515 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,921 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
-This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-F.2) and intra-report consistency (Section IV-F.3) findings.
+This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
 Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
 ## I. Ablation Study: Feature Backbone Comparison
@@ -8,39 +8,40 @@ occurring reference populations instead of manual labels:
  Positive anchor 1:  pixel_identical_to_closest = 1
      Two signature images byte-identical after crop/resize.
      Mathematically impossible to arise from independent hand-signing
-      => absolute ground truth for replication.
+      => pair-level proof of image reuse and a CONSERVATIVE-SUBSET
      ground truth for non-hand-signing (only those whose nearest
      same-CPA match happens to be byte-identical).
-  Positive anchor 2:  Firm A (Deloitte) signatures
+  Positive anchor 2:  Firm A signatures
-      Interview evidence from multiple Firm A accountants confirms that
+      Treated in the manuscript as a REPLICATION-DOMINATED population
-      MOST use replication (stamping / firm-level e-signing) but a
+      based on the paper's own image evidence: the byte-level pair
-      MINORITY may still hand-sign. Firm A is therefore a
+      analysis, the Firm A per-signature similarity distribution, the
-      "replication-dominated" population (not a pure one). We use it as
+      partner-ranking concentration, and the intra-report consistency
-      a strong prior positive for the majority regime, while noting that
+      gap. Approximately 7% of Firm A signatures fall below cosine
-      ~7% of Firm A signatures fall below cosine 0.95 consistent with
+      0.95, forming the long left tail observed in the dip test
-      the minority hand-signers. This matches the long left tail
+      (Script 15).
      observed in the dip test (Script 15) and the Firm A members who
      land in C2 (middle band) of the accountant-level GMM (Script 18).
  Negative anchor:    signatures with cosine <= low threshold
      Pairs with very low cosine similarity cannot plausibly be pixel
-      duplicates, so they serve as absolute negatives.
+      duplicates, so they serve as a conservative supplementary
      negative reference.
-Metrics reported:
+Metrics computed (legacy; NOT all reported in the manuscript):
-  - FAR/FRR/EER using the pixel-identity anchor as the gold positive
+  - FAR against the inter-CPA negative anchor is the primary metric
-    and low-similarity pairs as the gold negative.
+    reported (Table X). The byte-identical positive anchor has cosine
-  - Precision/Recall/F1 at cosine and dHash thresholds from Scripts
+    ~= 1 by construction, so FRR / EER / Precision / F1 against that
-    15/16/17/18.
+    subset are arithmetic tautologies (FRR is trivially 0 below
    threshold 1) and are intentionally OMITTED from Table X. Legacy
    EER/FRR/precision/F1 helper functions remain in this script for
    diagnostic use only and their outputs are NOT cited as biometric
    performance in the paper.
  - Convergence with Firm A anchor (what fraction of Firm A signatures
    are correctly classified at each threshold).
 Small visual sanity sample (30 pairs) is exported for spot-check, but
 metrics are derived entirely from pixel and Firm A evidence.
 Output:
  reports/pixel_validation/pixel_validation_report.md
  reports/pixel_validation/pixel_validation_results.json
  reports/pixel_validation/roc_cosine.png, roc_dhash.png
  reports/pixel_validation/sanity_sample.csv
 """
 import sqlite3
@@ -2,26 +2,39 @@
 """
 Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
 ============================================================================
-Addresses codex review weaknesses of Script 19's pixel-identity validation:
+Addresses three weaknesses of Script 19's pixel-identity validation:
  (a) Negative anchor of n=35 (cosine<0.70) is too small to give
      meaningful FAR confidence intervals.
-  (b) Pixel-identical positive anchor is an easy subset, not
+  (b) Pixel-identical positive anchor is a CONSERVATIVE SUBSET of the
-      representative of the broader positive class.
+      true non-hand-signed class, not representative of the broader
-  (c) Firm A is both the calibration anchor and the validation anchor
+      positive class. Recall against this subset is therefore a
-      (circular).
+      lower-bound calibration check, not a generalizable recall
      estimate.
  (c) Firm A is both the calibration anchor and a validation anchor
      (circular). The 70/30 fold split makes within-Firm-A sampling
      variance visible without claiming external validation.
 This script:
  1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
     randomly sampling pairs from different CPAs. Inter-CPA high
     similarity is highly unlikely to arise from legitimate signing.
  2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
-     Re-derives signature-level / accountant-level thresholds from the
+     Re-derives signature-level thresholds from the calibration fold
-     calibration fold only, then reports all metrics (including Firm A
+     only, then reports capture rates on the heldout fold.
-     anchor rates) on the heldout fold.
+  3. Computes 95% Wilson confidence intervals for FAR at canonical
-  3. Computes proper EER (FAR = FRR interpolated) in addition to
+     thresholds (Table X in the manuscript).
-     metrics at canonical thresholds.
+
-  4. Computes 95% Wilson confidence intervals for each FAR/FRR.
+Legacy / diagnostic-only metrics:
  Helper functions for EER, Precision, Recall, F1, and FRR remain in
  this script for backward compatibility. The manuscript intentionally
  OMITS these metrics from Table X because the byte-identical positive
  anchor has cosine ~= 1 by construction (so FRR / EER are arithmetic
  tautologies) and because positive and negative anchors are
  constructed from different sampling units, making prevalence
  arbitrary (so Precision and F1 have no meaningful population
  interpretation). Only FAR against the large inter-CPA negative
  anchor is reported as a biometric metric in the paper.
 Output:
  reports/expanded_validation/expanded_validation_report.md
@@ -0,0 +1,204 @@
 #!/usr/bin/env python3
 """
 Script 28: Byte-Identity Decomposition + Cross-Firm Dual-Descriptor Convergence
 ================================================================================
 Produces two reproducible artifacts cited in the manuscript that previously
 lacked dedicated provenance (codex review v3.18.1 items #7 and #8):
  (#7) Byte-identical Firm A signature decomposition:
       - Total Firm A signatures with pixel_identical_to_closest = 1
       - Number of distinct Firm A partners they span
       - Number of partners in the registry (denominator)
       - Number of byte-identical pairs that span DIFFERENT fiscal years
  (#8) Cross-firm dual-descriptor convergence:
       - Among signatures with cosine > 0.95 (per-signature best-match),
         the fraction with min_dhash_independent <= 5, broken out by
         Firm A vs Non-Firm-A.
 Output:
  /Volumes/NV2/PDF-Processing/signature-analysis/reports/byte_identity_decomp/
      byte_identity_decomposition.json
      byte_identity_decomposition.md
 These figures are intended to be cited from the paper (Section IV-F.1 for #7;
 Section IV-H.2 for #8) so that every quantitative claim in the manuscript
 traces to a specific JSON field.
 """
 import json
 import sqlite3
 from datetime import datetime
 from pathlib import Path
 DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
           'byte_identity_decomp')
 OUT.mkdir(parents=True, exist_ok=True)
 FIRM_A = '勤業眾信聯合'
 def byte_identity_decomposition(conn):
    """Codex item #7: 145 / 50 / 180 / 35 decomposition."""
    cur = conn.cursor()
    cur.execute("""
        SELECT COUNT(DISTINCT name)
        FROM accountants
        WHERE firm = ?
    """, (FIRM_A,))
    n_registered_partners = cur.fetchone()[0]
    cur.execute("""
        WITH byte_pairs AS (
          SELECT s1.signature_id AS sig_a,
                 s1.assigned_accountant AS partner,
                 s1.year_month AS ym_a,
                 s2.year_month AS ym_b
          FROM signatures s1
          JOIN signatures s2 ON s1.closest_match_file = s2.image_filename
          WHERE s1.pixel_identical_to_closest = 1
            AND s1.excel_firm = ?
        )
        SELECT
          COUNT(*) AS total_pixel_identical_firm_a,
          COUNT(DISTINCT partner) AS partners_with_pixel_identical,
          SUM(CASE WHEN substr(ym_a,1,4) <> substr(ym_b,1,4) THEN 1 ELSE 0 END)
            AS cross_year_pairs
        FROM byte_pairs
    """, (FIRM_A,))
    n_total, n_partners, n_cross_year = cur.fetchone()
    return {
        'definition': (
            'Among Firm A signatures whose nearest same-CPA match is '
            'byte-identical after crop and normalization '
            '(pixel_identical_to_closest = 1), this section reports the '
            'count, the distinct-partner spread, the registry denominator, '
            'and the subset whose byte-identical match is in a different '
            'fiscal year.'
        ),
        'firm_label': 'Firm A',
        'n_pixel_identical_firm_a_signatures': n_total,
        'n_distinct_partners_with_pixel_identical': n_partners,
        'n_registered_partners_in_firm_a': n_registered_partners,
        'partner_coverage_share': round(n_partners / n_registered_partners, 4),
        'n_cross_year_byte_identical_pairs': n_cross_year,
    }
 def cross_firm_dual_convergence(conn):
    """Codex item #8: per-signature dual-descriptor convergence by firm."""
    cur = conn.cursor()
    cur.execute("""
        SELECT
          CASE WHEN excel_firm = ? THEN 'Firm A' ELSE 'Non-Firm-A' END
            AS firm_group,
          COUNT(*) AS n_signatures_above_095,
          SUM(CASE WHEN min_dhash_independent <= 5 THEN 1 ELSE 0 END)
            AS n_dhash_le_5
        FROM signatures
        WHERE max_similarity_to_same_accountant > 0.95
          AND assigned_accountant IS NOT NULL
          AND min_dhash_independent IS NOT NULL
        GROUP BY firm_group
        ORDER BY firm_group
    """, (FIRM_A,))
    rows = cur.fetchall()
    by_group = {}
    for firm_group, n_above, n_dhash in rows:
        by_group[firm_group] = {
            'n_signatures_above_cosine_095': n_above,
            'n_dhash_indep_le_5': n_dhash,
            'pct_dhash_indep_le_5': round(100.0 * n_dhash / n_above, 2),
        }
    return {
        'definition': (
            'Per-signature best-match cosine > 0.95 AND assigned_accountant '
            'IS NOT NULL AND min_dhash_independent IS NOT NULL. The reported '
            'percentage is the share of these signatures whose independent '
            'min dHash to any same-CPA signature is <= 5.'
        ),
        'unit_of_observation': 'signature',
        'cosine_threshold': 0.95,
        'dhash_indep_threshold': 5,
        'by_firm_group': by_group,
    }
 def write_markdown(payload, path):
    bid = payload['byte_identity_decomposition']
    cf = payload['cross_firm_dual_convergence']
    lines = []
    lines.append('# Byte-Identity Decomposition + Cross-Firm Dual-Descriptor '
                 'Convergence')
    lines.append('')
    lines.append(f"Generated at: {payload['generated_at']}")
    lines.append('')
    lines.append('## 1. Byte-Identity Decomposition (Firm A)')
    lines.append('')
    lines.append(bid['definition'])
    lines.append('')
    lines.append('| Quantity | Value |')
    lines.append('|----------|-------|')
    lines.append(f"| Pixel-identical Firm A signatures | "
                 f"{bid['n_pixel_identical_firm_a_signatures']} |")
    lines.append(f"| Distinct Firm A partners with at least one such pair | "
                 f"{bid['n_distinct_partners_with_pixel_identical']} |")
    lines.append(f"| Registered Firm A partners | "
                 f"{bid['n_registered_partners_in_firm_a']} |")
    lines.append(f"| Partner coverage share | "
                 f"{bid['partner_coverage_share']:.3f} |")
    lines.append(f"| Pairs whose byte-identical match spans different fiscal "
                 f"years | {bid['n_cross_year_byte_identical_pairs']} |")
    lines.append('')
    lines.append('## 2. Cross-Firm Dual-Descriptor Convergence')
    lines.append('')
    lines.append(cf['definition'])
    lines.append('')
    lines.append('| Firm group | N signatures with cosine > 0.95 | '
                 'N with dHash_indep <= 5 | % with dHash_indep <= 5 |')
    lines.append('|------------|--------------------------------:|'
                 '------------------------:|------------------------:|')
    for grp in ('Firm A', 'Non-Firm-A'):
        g = cf['by_firm_group'][grp]
        lines.append(f"| {grp} | "
                     f"{g['n_signatures_above_cosine_095']:,} | "
                     f"{g['n_dhash_indep_le_5']:,} | "
                     f"{g['pct_dhash_indep_le_5']:.2f}% |")
    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
 def main():
    conn = sqlite3.connect(DB)
    try:
        payload = {
            'generated_at': datetime.now().isoformat(timespec='seconds'),
            'database_path': DB,
            'firm_a_label': FIRM_A,
            'byte_identity_decomposition': byte_identity_decomposition(conn),
            'cross_firm_dual_convergence': cross_firm_dual_convergence(conn),
        }
    finally:
        conn.close()
    json_path = OUT / 'byte_identity_decomposition.json'
    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
                         encoding='utf-8')
    print(f'Wrote {json_path}')
    md_path = OUT / 'byte_identity_decomposition.md'
    write_markdown(payload, md_path)
    print(f'Wrote {md_path}')
 if __name__ == '__main__':
    main()