Paper A v3.18.2: address codex GPT-5.5 round-16 Minor-Revision findings

Codex independent peer review (paper/codex_review_gpt55_v3_18_1.md) audited empirical claims against scripts/JSON reports rather than rubber-stamping prior Accept verdicts. Verdict: Minor Revision. This commit addresses every flagged item. - Soften mechanism-identification language (Results IV-D.1, Discussion B): per-signature cosine "fails to reject unimodality" rather than "reflects a single dominant generative mechanism"; framing tied to joint evidence. - Replace overabsolute "single stored image" with multi-template phrasing in Introduction and Methodology III-A. - Reframe Methodology III-H so practitioner knowledge is non-load-bearing; evidentiary basis is the paper's own image evidence. - Fix stale section cross-references after the v3.18 retitling: IV-F.* -> IV-G.* in 11 locations across methodology and results. - Fix 0.941 / 0.945 / 0.9407 wording in Methodology III-K to use the calibration-fold P5 = 0.9407 and the rounded sensitivity cut 0.945. - Soften "sharp discontinuity" in Results IV-G.3 to "23-28 percentage-point gap consistent with firm-wide non-hand-signing practice". - Soften Conclusion's "directly generalizable" with explicit conditions on analogous anchors and artifact-generation physics. - Add Appendix B: table-to-script provenance map (15 manuscript tables mapped to generating scripts and JSON report artifacts). - New script signature_analysis/28_byte_identity_decomposition.py produces reproducible artifacts for two previously-unverified claims: (a) 145 / 50 / 180 / 35 Firm A byte-identity decomposition (verified); (b) cross-firm dual-descriptor convergence -- corrected from the previous manuscript text "non-Firm-A 11.3% vs Firm A 58.7% (5x)" to the database-verified "non-Firm-A 42.12% vs Firm A 88.32% (~2.1x)". - Clarify scripts 19 / 21 docstrings: legacy EER / FRR / Precision / F1 helpers are retained for diagnostic use only and are NOT cited as biometric performance in the paper. Remove "interview evidence" wording. - Rebuild Paper_A_IEEE_Access_Draft_v3.docx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 20:23:08 +08:00
parent cb77f481ec
commit 4bb7aa9189
9 changed files with 299 additions and 53 deletions
@@ -33,3 +33,30 @@ Taken together, Table A.I shows that the signature-level BD/McCrary transitions
 This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that per-signature similarity does not form a clean two-mechanism mixture.
 Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
 # Appendix B. Table-to-Script Provenance
 For reproducibility, the following table maps each numerical table in Section IV to the analysis script that produces its underlying values and to the JSON / Markdown report file emitted by that script. Scripts referenced are under `signature_analysis/` and reports under the project's `reports/` tree.
 <!-- TABLE B.I: Manuscript table → reproduction artifact
 | Manuscript table | Generating script | Report artifact |
 |------------------|-------------------|-----------------|
 | Table III (extraction results) | `02_extract_features.py`; `09_pdf_signature_verdict.py` | extraction logs (supplementary) |
 | Table IV (intra/inter all-pairs cosine statistics) | `10_formal_statistical_analysis.py` | `reports/formal_statistical/formal_statistical_results.json` |
 | Table V (Hartigan dip test) | `15_hartigan_dip_test.py` | `reports/dip_test/dip_test_results.json` |
 | Table VI (signature-level threshold-estimator summary) | `17_beta_mixture_em.py`; `25_bd_mccrary_sensitivity.py` | `reports/beta_mixture/beta_mixture_results.json`; `reports/bd_sensitivity/bd_sensitivity.json` |
 | Table IX (Firm A whole-sample capture rates) | `19_pixel_identity_validation.py`; `24_validation_recalibration.py` | `reports/pixel_validation/pixel_validation_results.json`; `reports/validation_recalibration/validation_recalibration.json` |
 | Table X (cosine threshold sweep, FAR vs inter-CPA negatives) | `21_expanded_validation.py` | `reports/expanded_validation/expanded_validation_results.json` |
 | Table XI (held-out vs calibration Firm A capture rates) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
 | Table XII (operational-cut sensitivity 0.95 vs 0.945) | `24_validation_recalibration.py` | `reports/validation_recalibration/validation_recalibration.json` |
 | Table XIII (Firm A per-year cosine distribution) | `13_deloitte_distribution_analysis.py` | `reports/deloitte_distribution/deloitte_distribution_results.json` |
 | Tables XIV / XV (partner-level similarity ranking) | `22_partner_ranking.py` | `reports/partner_ranking/partner_ranking_results.json` |
 | Table XVI (intra-report classification agreement) | `23_intra_report_consistency.py` | `reports/intra_report/intra_report_results.json` |
 | Table XVII (document-level five-way classification) | `09_pdf_signature_verdict.py`; `12_generate_pdf_level_report.py` | `reports/pdf_level/pdf_level_results.json` |
 | Table XVIII (backbone ablation) | `paper/ablation_backbone_comparison.py` | `reports/ablation/ablation_results.json` |
 | Table A.I (BD/McCrary bin-width sensitivity) | `25_bd_mccrary_sensitivity.py` | `reports/bd_sensitivity/bd_sensitivity.json` |
 | Byte-identity decomposition (145 / 50 / 180 / 35; Section IV-F.1) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
 | Cross-firm dual-descriptor convergence (Section IV-H.2) | `28_byte_identity_decomposition.py` | `reports/byte_identity_decomp/byte_identity_decomposition.json` |
 -->
 The table-to-script mapping above is intended as a navigation aid for replicators. All scripts run deterministically under the fixed random seeds documented in the supplementary materials; report files are committed alongside the scripts so that each numerical claim in Section IV traces to a specific JSON field rather than to an undocumented intermediate computation.
@@ -27,5 +27,5 @@ Several directions merit further investigation.
 Domain-adapted feature extractors, trained or fine-tuned on signature-specific datasets, may improve discriminative performance beyond the transferred ImageNet features used in this study.
 Extending the analysis to auditor-year units---computing per-signature statistics within each fiscal year and tracking how individual CPAs move across years---could reveal within-CPA transitions between hand-signing and non-hand-signing over the decade and is the natural next step beyond the cross-sectional analysis reported here.
 The pipeline's applicability to other jurisdictions and document types (e.g., corporate filings in other countries, legal documents, medical records) warrants exploration.
-The replication-dominated calibration strategy and the pixel-identity anchor technique are both directly generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself.
+The replication-dominated calibration strategy and the pixel-identity anchor technique are both generalizable to settings in which (i) a reference subpopulation has a known dominant mechanism and (ii) the target mechanism leaves a byte-level signature in the artifact itself, conditional on the availability of analogous anchors in the new domain and on artifact-generation physics that preserve the byte-level trace.
 Finally, integration with regulatory monitoring systems and a larger negative-anchor study---for example drawing from inter-CPA pairs under explicit accountant-level blocking---would strengthen the practical deployment potential of this approach.
@@ -9,7 +9,7 @@ While the law permits either a handwritten signature or a seal, the CPA's attest
 The digitization of financial reporting has introduced a practice that complicates this intent.
 As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one.
 This reproduction can occur either through an administrative stamping workflow---in which scanned signature images are affixed by staff as part of the report-assembly process---or through a firm-level electronic signing system that automates the same step.
-From the perspective of the output image the two workflows are equivalent: both yield a pixel-level reproduction of a single stored image on every report the partner signs off, so that signatures on different reports of the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
 We refer to signatures produced by either workflow collectively as *non-hand-signed*.
 Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement.
 The accounting literature has long examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33].
@@ -7,7 +7,7 @@ Fig. 1 illustrates the overall architecture.
 The pipeline takes as input a corpus of PDF audit reports and produces, for each document, a classification of its CPA signatures along a confidence continuum anchored on whole-sample Firm A percentile heuristics and validated against a byte-level pixel-identity positive anchor and a large random inter-CPA negative anchor.
 Throughout this paper we use the term *non-hand-signed* rather than "digitally replicated" to denote any signature produced by reproducing a previously stored image of the partner's signature---whether by administrative stamping workflows (dominant in the early years of the sample) or firm-level electronic signing systems (dominant in the later years).
-From the perspective of the output image the two workflows are equivalent: both reproduce a single stored image so that signatures on different reports from the same partner are identical up to reproduction noise.
+From the perspective of the output image the two workflows are equivalent: both can reproduce one or more stored signature images, producing same-CPA signatures that are identical or near-identical up to reproduction, scanning, compression, and template-variant noise.
 <!--
 [Figure 1: Pipeline Architecture - clean vector diagram]
@@ -116,7 +116,7 @@ Cosine similarity and dHash are both robust to the noise introduced by the print
 Two unit-of-analysis choices are relevant for this study, ordered from finest to coarsest: (i) the *signature*---one signature image extracted from one report; and (ii) the *auditor-year*---all signatures by one CPA within one fiscal year.
 The signature is the operational unit of classification (Section III-K) and of all primary statistical analyses (Section IV-D, IV-F, IV-G).
-The auditor-year is used in the partner-level similarity ranking of Section IV-F.2 as a deliberately within-year aggregation that avoids cross-year pooling.
+The auditor-year is used in the partner-level similarity ranking of Section IV-G.2 as a deliberately within-year aggregation that avoids cross-year pooling.
 We do not use a coarser CPA-level cross-year unit, because pooling a CPA's signatures across the full 2013--2023 sample period would conflate distinct signing-mechanism regimes whenever a CPA's practice changes during the sample, and we make no claim about the within-CPA stability of signing mechanisms over time.
 For per-signature classification we compute, for each signature, the maximum pairwise cosine similarity and the minimum dHash Hamming distance against every other signature attributed to the same CPA (over the full same-CPA set, not restricted to the same fiscal year).
@@ -136,29 +136,29 @@ We make *no* within-year or across-year uniformity assumption about CPA signing
 Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation.
 A CPA's signing output within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., different stored images for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination; our signature-level analyses remain valid under all of these regimes, since they do not attempt mechanism attribution at the partner or report level.
-The intra-report consistency analysis in Section IV-F.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
+The intra-report consistency analysis in Section IV-G.3 is a firm-level homogeneity check---whether the *two co-signing CPAs on the same report* receive the same signature-level label under the operational classifier---rather than a test of within-partner or within-year uniformity.
 ## H. Calibration Reference: Firm A as a Replication-Dominated Population
 A distinctive aspect of our methodology is the use of Firm A---a major Big-4 accounting firm in Taiwan---as an empirical calibration reference.
 Rather than treating Firm A as a synthetic or laboratory positive control, we treat it as a naturally occurring *replication-dominated population*: a CPA population whose aggregate signing behavior is dominated by non-hand-signing but is not a pure positive class.
-The background context for this choice is practitioner knowledge about Firm A's signing practice: industry practice at the firm is widely understood among practitioners to involve reproducing a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
+Practitioner knowledge motivated treating Firm A as a candidate calibration reference: it is widely held within the audit profession that the firm reproduces a stored signature image for the majority of certifying partners---originally via administrative stamping workflows and later via firm-level electronic signing systems---while not ruling out that a minority of partners may continue to hand-sign some or all of their reports.
-We use this only as background context for why Firm A is a plausible calibration candidate; the *evidence* for Firm A's replication-dominated status comes entirely from the paper's own analyses, which do not depend on any claim about signing practice beyond what the audit-report images themselves show.
+This practitioner background is *non-load-bearing* in our analysis: the evidentiary basis used in this paper is the observable image evidence reported below---byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency---which does not depend on any claim about signing practice beyond what the audit-report images themselves show.
 We establish Firm A's replication-dominated status through two primary independent quantitative analyses plus a third strand comprising three complementary checks, each of which can be reproduced from the public audit-report corpus alone:
-First, *automated byte-level pair analysis* (Section IV-F.1) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
+First, *automated byte-level pair analysis* (Section IV-F.1; reproduced by `signature_analysis/28_byte_identity_decomposition.py` with output in `reports/byte_identity_decomp/byte_identity_decomposition.json`) identifies 145 Firm A signatures that are byte-identical to at least one other same-CPA signature from a different audit report, distributed across 50 distinct Firm A partners (of 180 registered); 35 of these byte-identical matches span different fiscal years.
 Byte-identity implies pixel-identity by construction, and independent hand-signing cannot produce pixel-identical images across distinct reports---these pairs therefore establish image reuse as a concrete, threshold-free phenomenon within Firm A and confirm that replication is widespread (50 of 180 registered partners) rather than confined to a handful of CPAs.
 Second, *signature-level distributional evidence*: Firm A's per-signature best-match cosine distribution is unimodal with a long left tail (Hartigan dip test $p = 0.17$ at $n \geq 10$ signatures; Section IV-D), consistent with a single dominant mechanism (non-hand-signing) plus residual within-firm heterogeneity rather than two cleanly separated mechanisms.
 92.5% of Firm A's per-signature best-match cosine similarities exceed 0.95 and the remaining 7.5% form the long left tail (we do not disaggregate partner-level mechanism here; see Section III-G for the scope of claims).
-The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-F.1).
+The unimodal-long-tail shape, not the precise 92.5/7.5 split, is the structural evidence: it predicts that Firm A is replication-dominated rather than a clean two-class population, and a noise-only explanation of the left tail would predict a shrinking share as scan/PDF technology matured over 2013--2023, which is not what we observe (Section IV-G.1).
-Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-F. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
+Third, we additionally validate the Firm A benchmark through three complementary analyses reported in Section IV-G. Only the partner-level ranking is fully threshold-free; the longitudinal-stability and intra-report analyses use the operational classifier and are interpreted as consistency checks on its firm-level output:
-  (a) *Longitudinal stability (Section IV-F.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
+  (a) *Longitudinal stability (Section IV-G.1).* The share of Firm A per-signature best-match cosine values below 0.95 is stable at 6-13% across 2013-2023, with the lowest share in 2023. The 0.95 cutoff is the whole-sample Firm A P7.5 heuristic (Section III-K; 92.5% of whole-sample Firm A signatures exceed this cutoff); the substantive finding here is the *temporal stability* of the rate, not the absolute rate at any single year.
-  (b) *Partner-level similarity ranking (Section IV-F.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
+  (b) *Partner-level similarity ranking (Section IV-G.2).* When every auditor-year is ranked globally by its per-auditor-year mean best-match cosine (across all firms: Big-4 and Non-Big-4), Firm A auditor-years account for 95.9% of the top decile against a baseline share of 27.8% (a 3.5$\times$ concentration ratio), and this over-representation is stable across 2013-2023. This analysis uses only the ordinal ranking and is independent of any absolute cutoff.
-  (c) *Intra-report consistency (Section IV-F.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
+  (c) *Intra-report consistency (Section IV-G.3).* Because each Taiwanese statutory audit report is co-signed by two engagement partners, firm-wide stamping practice predicts that both signers on a given Firm A report should receive the same signature-level label under the classifier. Firm A exhibits 89.9% intra-report agreement against 62-67% at the other Big-4 firms. This test uses the operational classifier and is therefore a *consistency* check on the classifier's firm-level output rather than a threshold-free test; the cross-firm gap (not the absolute rate) is the substantive finding.
 We emphasize that the 92.5% figure is a within-sample consistency check rather than an independent validation of Firm A's status; the validation role is played by the byte-level pixel-identity evidence, the unimodal-long-tail dip-test result, the three complementary analyses above, and the held-out Firm A fold (described in Section III-J; fold-level rate differences are disclosed in Section IV-F.2).
@@ -280,7 +280,7 @@ High feature-level similarity without structural corroboration---consistent with
 We note three conventions about the thresholds.
 First, the cosine cutoff $0.95$ corresponds to approximately the whole-sample Firm A P7.5 of the per-signature best-match cosine distribution---that is, 92.5% of whole-sample Firm A signatures exceed this cutoff and 7.5% fall at or below it (Section III-H)---chosen as a round-number lower-tail boundary whose complement (92.5% above) has a transparent interpretation in the whole-sample reference distribution; the cosine crossover $0.837$ is the all-pairs intra/inter KDE crossover; both are derived from whole-sample distributions rather than from the 70% calibration fold, so the classifier inherits its operational cosine cuts from the whole-sample Firm A and all-pairs distributions.
-Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the slightly stricter Firm A P5 percentile $0.941$ alters aggregate firm-level capture rates by at most $\approx 1.2$ percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
+Section IV-F.3 reports a sensitivity check confirming that replacing $0.95$ with the nearby rounded sensitivity cut $0.945$ (motivated by the calibration-fold P5 = 0.9407, see Section IV-F.2) shifts whole-Firm-A dual-rule capture by 1.19 percentage points, so the round-number heuristic is robust to nearby percentile-based alternatives.
 Section IV-F.2 reports both calibration-fold and held-out-fold capture rates for this classifier so that fold-level sampling variance is visible.
 Second, the dHash cutoffs $\leq 5$ and $> 15$ are chosen from the whole-sample Firm A $\text{dHash}_\text{indep}$ distribution: $\leq 5$ captures the upper tail of the high-similarity mode (whole-sample Firm A median $\text{dHash}_\text{indep} = 2$, P75 $\approx 4$, so $\leq 5$ is the band immediately above median), while $> 15$ marks the regime in which independent-minimum structural similarity is no longer indicative of image reproduction.
 Third, the signature-level threshold-estimator outputs of Section IV-D (KDE antimode, Beta-mixture and logit-Gaussian crossings, BD/McCrary diagnostic) are *not* the operational thresholds of this classifier: they are descriptive characterisation of the per-signature similarity distribution, and Section IV-D shows they do not converge to a clean two-mechanism boundary at the per-signature level---which is why the operational cosine cut is anchored on the whole-sample Firm A percentile rather than on any mixture-fit crossing.
@@ -73,9 +73,9 @@ The $N = 168{,}740$ count used in Table V and in the downstream same-CPA per-sig
 | All-CPA min dHash (independent) | 168,740 | 0.0468 | <0.001 | Multimodal |
 -->
-Firm A's per-signature cosine distribution is *unimodal* ($p = 0.17$), reflecting a single dominant generative mechanism (non-hand-signing) with a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
+Firm A's per-signature cosine distribution *fails to reject unimodality* ($p = 0.17$), a pattern consistent with a dominant high-similarity regime plus a long left tail attributable to within-firm heterogeneity in signing outputs (Section III-G discusses the scope of partner-level claims).
 The all-CPA cosine distribution, which mixes many firms with heterogeneous signing practices, is *multimodal* ($p < 0.001$).
-The Firm A unimodal-long-tail finding is the structural evidence that supports the replication-dominated framing (Section III-H): a single dominant mechanism plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
+The Firm A unimodal-long-tail finding is, in conjunction with the byte-identity, partner-ranking, and intra-report evidence reported below, consistent with the replication-dominated framing (Section III-H): a dominant high-similarity regime plus residual within-firm heterogeneity, rather than two cleanly separated mechanisms.
 ### 2) Burgstahler-Dichev / McCrary Density-Smoothness Diagnostic
@@ -204,7 +204,7 @@ Under this proper test the two extreme rules agree across folds (cosine $> 0.837
 The operationally relevant rules in the 85–95% capture band differ between folds by 1–5 percentage points ($p < 0.001$ given the $n \approx 45\text{k}/15\text{k}$ fold sizes).
 Both folds nevertheless sit in the same replication-dominated regime: every calibration-fold rate in the 85–99% range has a held-out counterpart in the 87–99% range, and the operational dual rule cosine $> 0.95$ AND $\text{dHash}_\text{indep} \leq 8$ captures 89.40% of the calibration fold and 91.54% of the held-out fold.
 The modest fold gap is consistent with within-Firm-A heterogeneity in replication intensity: the random 30% CPA sample evidently contained proportionally more high-replication CPAs.
-We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-F.2) is the cross-check that is robust to this fold variance.
+We therefore interpret the held-out fold as confirming the qualitative finding (Firm A is strongly replication-dominated across both folds) while cautioning that exact rates carry fold-level sampling noise that a single 30% split cannot eliminate; the threshold-independent partner-ranking analysis (Section IV-G.2) is the cross-check that is robust to this fold variance.
 ### 3) Operational-Threshold Sensitivity: cos $> 0.95$ vs cos $> 0.945$
@@ -332,7 +332,7 @@ A report is "in agreement" if both signature labels fall in the same coarse buck
 Firm A achieves 89.9% intra-report agreement, with 87.5% of Firm A reports having *both* signers classified as non-hand-signed and only 4 reports (0.01%) having both classified as likely hand-signed.
 The other Big-4 firms (B, C, D) and non-Big-4 firms cluster at 62-67% agreement, a 23-28 percentage-point gap.
-This sharp discontinuity in intra-report agreement between Firm A and the other firms is the pattern predicted by firm-wide (rather than partner-specific) non-hand-signing practice.
+This 23-28 percentage-point gap in intra-report agreement between Firm A and the other firms is consistent with firm-wide (rather than partner-specific) non-hand-signing practice; we do not claim a sharp discontinuity in the formal sense, since classifier calibration, firm-specific document-production pipelines, and signer-mix differences could each contribute to gap magnitude.
 We note that this test uses the calibrated classifier of Section III-K rather than a threshold-free statistic; the substantive evidence lies in the *cross-firm gap* between Firm A and the other firms rather than in the absolute agreement rate at any single firm, and that gap is robust to moderate shifts in the absolute cutoff so long as the cutoff is applied uniformly across firms.
@@ -341,7 +341,7 @@ We note that this test uses the calibrated classifier of Section III-K rather th
 Table XVII presents the final classification results under the dual-descriptor framework with Firm A-calibrated thresholds for 84,386 documents.
 The document count (84,386) differs from the 85,042 documents with any YOLO detection (Table III) because 656 documents carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available; those documents are excluded from the classification reported here.
 We emphasize that the document-level proportions below reflect the *worst-case aggregation rule* of Section III-K: a report carrying one stamped signature and one hand-signed signature is labeled with the most-replication-consistent of the two signature-level verdicts.
-Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-F.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
+Document-level rates therefore represent the share of reports in which *at least one* signature is non-hand-signed rather than the share in which *both* are; the intra-report agreement analysis of Section IV-G.3 (Table XVI) reports how frequently the two co-signers share the same signature-level label within each firm, so that readers can judge what fraction of the non-hand-signed document-level share corresponds to fully non-hand-signed reports versus mixed reports.
 <!-- TABLE XVII: Document-Level Classification (Dual-Descriptor: Cosine + dHash)
 | Verdict | N (PDFs) | % | Firm A | Firm A % |
@@ -370,8 +370,9 @@ We note that because the non-hand-signed thresholds are themselves calibrated to
 ### 2) Cross-Firm Comparison of Dual-Descriptor Convergence
-Among non-Firm-A CPAs with cosine $> 0.95$, only 11.3% exhibit dHash $\leq 5$, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.
+Among the 65,515 non-Firm-A signatures with per-signature best-match cosine $> 0.95$, 42.12% have $\text{dHash}_\text{indep} \leq 5$, compared to 88.32% of the 55,921 Firm A signatures meeting the same cosine condition---a $\sim 2.1\times$ difference that the structural-verification layer makes visible.
-This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-F.2) and intra-report consistency (Section IV-F.3) findings.
+This cross-firm gap is consistent with firm-wide non-hand-signing practice at Firm A versus partner-specific or per-engagement replication at other firms; it complements the partner-level ranking (Section IV-G.2) and intra-report consistency (Section IV-G.3) findings.
 Counts and percentages are reproduced by `signature_analysis/28_byte_identity_decomposition.py` and reported in `reports/byte_identity_decomp/byte_identity_decomposition.json` (see Appendix B for the table-to-script provenance map).
 ## I. Ablation Study: Feature Backbone Comparison
@@ -8,39 +8,40 @@ occurring reference populations instead of manual labels:
  Positive anchor 1:  pixel_identical_to_closest = 1
      Two signature images byte-identical after crop/resize.
      Mathematically impossible to arise from independent hand-signing
-      => absolute ground truth for replication.
+      => pair-level proof of image reuse and a CONSERVATIVE-SUBSET
      ground truth for non-hand-signing (only those whose nearest
      same-CPA match happens to be byte-identical).
-  Positive anchor 2:  Firm A (Deloitte) signatures
+  Positive anchor 2:  Firm A signatures
-      Interview evidence from multiple Firm A accountants confirms that
+      Treated in the manuscript as a REPLICATION-DOMINATED population
-      MOST use replication (stamping / firm-level e-signing) but a
+      based on the paper's own image evidence: the byte-level pair
-      MINORITY may still hand-sign. Firm A is therefore a
+      analysis, the Firm A per-signature similarity distribution, the
-      "replication-dominated" population (not a pure one). We use it as
+      partner-ranking concentration, and the intra-report consistency
-      a strong prior positive for the majority regime, while noting that
+      gap. Approximately 7% of Firm A signatures fall below cosine
-      ~7% of Firm A signatures fall below cosine 0.95 consistent with
+      0.95, forming the long left tail observed in the dip test
-      the minority hand-signers. This matches the long left tail
+      (Script 15).
      observed in the dip test (Script 15) and the Firm A members who
      land in C2 (middle band) of the accountant-level GMM (Script 18).
  Negative anchor:    signatures with cosine <= low threshold
      Pairs with very low cosine similarity cannot plausibly be pixel
-      duplicates, so they serve as absolute negatives.
+      duplicates, so they serve as a conservative supplementary
      negative reference.
-Metrics reported:
+Metrics computed (legacy; NOT all reported in the manuscript):
-  - FAR/FRR/EER using the pixel-identity anchor as the gold positive
+  - FAR against the inter-CPA negative anchor is the primary metric
-    and low-similarity pairs as the gold negative.
+    reported (Table X). The byte-identical positive anchor has cosine
-  - Precision/Recall/F1 at cosine and dHash thresholds from Scripts
+    ~= 1 by construction, so FRR / EER / Precision / F1 against that
-    15/16/17/18.
+    subset are arithmetic tautologies (FRR is trivially 0 below
    threshold 1) and are intentionally OMITTED from Table X. Legacy
    EER/FRR/precision/F1 helper functions remain in this script for
    diagnostic use only and their outputs are NOT cited as biometric
    performance in the paper.
  - Convergence with Firm A anchor (what fraction of Firm A signatures
    are correctly classified at each threshold).
 Small visual sanity sample (30 pairs) is exported for spot-check, but
 metrics are derived entirely from pixel and Firm A evidence.
 Output:
  reports/pixel_validation/pixel_validation_report.md
  reports/pixel_validation/pixel_validation_results.json
  reports/pixel_validation/roc_cosine.png, roc_dhash.png
  reports/pixel_validation/sanity_sample.csv
 """
 import sqlite3
@@ -2,26 +2,39 @@
 """
 Script 21: Expanded Validation with Larger Negative Anchor + Held-out Firm A
 ============================================================================
-Addresses codex review weaknesses of Script 19's pixel-identity validation:
+Addresses three weaknesses of Script 19's pixel-identity validation:
  (a) Negative anchor of n=35 (cosine<0.70) is too small to give
      meaningful FAR confidence intervals.
-  (b) Pixel-identical positive anchor is an easy subset, not
+  (b) Pixel-identical positive anchor is a CONSERVATIVE SUBSET of the
-      representative of the broader positive class.
+      true non-hand-signed class, not representative of the broader
-  (c) Firm A is both the calibration anchor and the validation anchor
+      positive class. Recall against this subset is therefore a
-      (circular).
+      lower-bound calibration check, not a generalizable recall
      estimate.
  (c) Firm A is both the calibration anchor and a validation anchor
      (circular). The 70/30 fold split makes within-Firm-A sampling
      variance visible without claiming external validation.
 This script:
  1. Constructs a large inter-CPA negative anchor (~50,000 pairs) by
     randomly sampling pairs from different CPAs. Inter-CPA high
     similarity is highly unlikely to arise from legitimate signing.
  2. Splits Firm A CPAs 70/30 into CALIBRATION and HELDOUT folds.
-     Re-derives signature-level / accountant-level thresholds from the
+     Re-derives signature-level thresholds from the calibration fold
-     calibration fold only, then reports all metrics (including Firm A
+     only, then reports capture rates on the heldout fold.
-     anchor rates) on the heldout fold.
+  3. Computes 95% Wilson confidence intervals for FAR at canonical
-  3. Computes proper EER (FAR = FRR interpolated) in addition to
+     thresholds (Table X in the manuscript).
-     metrics at canonical thresholds.
+
-  4. Computes 95% Wilson confidence intervals for each FAR/FRR.
+Legacy / diagnostic-only metrics:
  Helper functions for EER, Precision, Recall, F1, and FRR remain in
  this script for backward compatibility. The manuscript intentionally
  OMITS these metrics from Table X because the byte-identical positive
  anchor has cosine ~= 1 by construction (so FRR / EER are arithmetic
  tautologies) and because positive and negative anchors are
  constructed from different sampling units, making prevalence
  arbitrary (so Precision and F1 have no meaningful population
  interpretation). Only FAR against the large inter-CPA negative
  anchor is reported as a biometric metric in the paper.
 Output:
  reports/expanded_validation/expanded_validation_report.md
@@ -0,0 +1,204 @@
 #!/usr/bin/env python3
 """
 Script 28: Byte-Identity Decomposition + Cross-Firm Dual-Descriptor Convergence
 ================================================================================
 Produces two reproducible artifacts cited in the manuscript that previously
 lacked dedicated provenance (codex review v3.18.1 items #7 and #8):
  (#7) Byte-identical Firm A signature decomposition:
       - Total Firm A signatures with pixel_identical_to_closest = 1
       - Number of distinct Firm A partners they span
       - Number of partners in the registry (denominator)
       - Number of byte-identical pairs that span DIFFERENT fiscal years
  (#8) Cross-firm dual-descriptor convergence:
       - Among signatures with cosine > 0.95 (per-signature best-match),
         the fraction with min_dhash_independent <= 5, broken out by
         Firm A vs Non-Firm-A.
 Output:
  /Volumes/NV2/PDF-Processing/signature-analysis/reports/byte_identity_decomp/
      byte_identity_decomposition.json
      byte_identity_decomposition.md
 These figures are intended to be cited from the paper (Section IV-F.1 for #7;
 Section IV-H.2 for #8) so that every quantitative claim in the manuscript
 traces to a specific JSON field.
 """
 import json
 import sqlite3
 from datetime import datetime
 from pathlib import Path
 DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
 OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
           'byte_identity_decomp')
 OUT.mkdir(parents=True, exist_ok=True)
 FIRM_A = '勤業眾信聯合'
 def byte_identity_decomposition(conn):
    """Codex item #7: 145 / 50 / 180 / 35 decomposition."""
    cur = conn.cursor()
    cur.execute("""
        SELECT COUNT(DISTINCT name)
        FROM accountants
        WHERE firm = ?
    """, (FIRM_A,))
    n_registered_partners = cur.fetchone()[0]
    cur.execute("""
        WITH byte_pairs AS (
          SELECT s1.signature_id AS sig_a,
                 s1.assigned_accountant AS partner,
                 s1.year_month AS ym_a,
                 s2.year_month AS ym_b
          FROM signatures s1
          JOIN signatures s2 ON s1.closest_match_file = s2.image_filename
          WHERE s1.pixel_identical_to_closest = 1
            AND s1.excel_firm = ?
        )
        SELECT
          COUNT(*) AS total_pixel_identical_firm_a,
          COUNT(DISTINCT partner) AS partners_with_pixel_identical,
          SUM(CASE WHEN substr(ym_a,1,4) <> substr(ym_b,1,4) THEN 1 ELSE 0 END)
            AS cross_year_pairs
        FROM byte_pairs
    """, (FIRM_A,))
    n_total, n_partners, n_cross_year = cur.fetchone()
    return {
        'definition': (
            'Among Firm A signatures whose nearest same-CPA match is '
            'byte-identical after crop and normalization '
            '(pixel_identical_to_closest = 1), this section reports the '
            'count, the distinct-partner spread, the registry denominator, '
            'and the subset whose byte-identical match is in a different '
            'fiscal year.'
        ),
        'firm_label': 'Firm A',
        'n_pixel_identical_firm_a_signatures': n_total,
        'n_distinct_partners_with_pixel_identical': n_partners,
        'n_registered_partners_in_firm_a': n_registered_partners,
        'partner_coverage_share': round(n_partners / n_registered_partners, 4),
        'n_cross_year_byte_identical_pairs': n_cross_year,
    }
 def cross_firm_dual_convergence(conn):
    """Codex item #8: per-signature dual-descriptor convergence by firm."""
    cur = conn.cursor()
    cur.execute("""
        SELECT
          CASE WHEN excel_firm = ? THEN 'Firm A' ELSE 'Non-Firm-A' END
            AS firm_group,
          COUNT(*) AS n_signatures_above_095,
          SUM(CASE WHEN min_dhash_independent <= 5 THEN 1 ELSE 0 END)
            AS n_dhash_le_5
        FROM signatures
        WHERE max_similarity_to_same_accountant > 0.95
          AND assigned_accountant IS NOT NULL
          AND min_dhash_independent IS NOT NULL
        GROUP BY firm_group
        ORDER BY firm_group
    """, (FIRM_A,))
    rows = cur.fetchall()
    by_group = {}
    for firm_group, n_above, n_dhash in rows:
        by_group[firm_group] = {
            'n_signatures_above_cosine_095': n_above,
            'n_dhash_indep_le_5': n_dhash,
            'pct_dhash_indep_le_5': round(100.0 * n_dhash / n_above, 2),
        }
    return {
        'definition': (
            'Per-signature best-match cosine > 0.95 AND assigned_accountant '
            'IS NOT NULL AND min_dhash_independent IS NOT NULL. The reported '
            'percentage is the share of these signatures whose independent '
            'min dHash to any same-CPA signature is <= 5.'
        ),
        'unit_of_observation': 'signature',
        'cosine_threshold': 0.95,
        'dhash_indep_threshold': 5,
        'by_firm_group': by_group,
    }
 def write_markdown(payload, path):
    bid = payload['byte_identity_decomposition']
    cf = payload['cross_firm_dual_convergence']
    lines = []
    lines.append('# Byte-Identity Decomposition + Cross-Firm Dual-Descriptor '
                 'Convergence')
    lines.append('')
    lines.append(f"Generated at: {payload['generated_at']}")
    lines.append('')
    lines.append('## 1. Byte-Identity Decomposition (Firm A)')
    lines.append('')
    lines.append(bid['definition'])
    lines.append('')
    lines.append('| Quantity | Value |')
    lines.append('|----------|-------|')
    lines.append(f"| Pixel-identical Firm A signatures | "
                 f"{bid['n_pixel_identical_firm_a_signatures']} |")
    lines.append(f"| Distinct Firm A partners with at least one such pair | "
                 f"{bid['n_distinct_partners_with_pixel_identical']} |")
    lines.append(f"| Registered Firm A partners | "
                 f"{bid['n_registered_partners_in_firm_a']} |")
    lines.append(f"| Partner coverage share | "
                 f"{bid['partner_coverage_share']:.3f} |")
    lines.append(f"| Pairs whose byte-identical match spans different fiscal "
                 f"years | {bid['n_cross_year_byte_identical_pairs']} |")
    lines.append('')
    lines.append('## 2. Cross-Firm Dual-Descriptor Convergence')
    lines.append('')
    lines.append(cf['definition'])
    lines.append('')
    lines.append('| Firm group | N signatures with cosine > 0.95 | '
                 'N with dHash_indep <= 5 | % with dHash_indep <= 5 |')
    lines.append('|------------|--------------------------------:|'
                 '------------------------:|------------------------:|')
    for grp in ('Firm A', 'Non-Firm-A'):
        g = cf['by_firm_group'][grp]
        lines.append(f"| {grp} | "
                     f"{g['n_signatures_above_cosine_095']:,} | "
                     f"{g['n_dhash_indep_le_5']:,} | "
                     f"{g['pct_dhash_indep_le_5']:.2f}% |")
    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')
 def main():
    conn = sqlite3.connect(DB)
    try:
        payload = {
            'generated_at': datetime.now().isoformat(timespec='seconds'),
            'database_path': DB,
            'firm_a_label': FIRM_A,
            'byte_identity_decomposition': byte_identity_decomposition(conn),
            'cross_firm_dual_convergence': cross_firm_dual_convergence(conn),
        }
    finally:
        conn.close()
    json_path = OUT / 'byte_identity_decomposition.json'
    json_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False),
                         encoding='utf-8')
    print(f'Wrote {json_path}')
    md_path = OUT / 'byte_identity_decomposition.md'
    write_markdown(payload, md_path)
    print(f'Wrote {md_path}')
 if __name__ == '__main__':
    main()