Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 13:44:49 +08:00
parent 623eb4cd4b
commit 53125d11d9
13 changed files with 1554 additions and 112 deletions
+4 -7
View File
@@ -61,7 +61,7 @@ The dual-descriptor framework correctly identifies these cases as distinct from
The use of Firm A as a calibration reference addresses a fundamental challenge in document forensics: the scarcity of ground truth labels.
In most forensic applications, establishing ground truth requires expensive manual verification or access to privileged information about document provenance.
Our approach leverages domain knowledge---the established prevalence of non-hand-signing at a specific firm---to create a naturally occurring reference population within the dataset.
Our approach uses practitioner background---one Big-4 firm reportedly relies predominantly on stamping or e-signing workflows---only as a *motivation* for selecting that firm as a candidate reference population; the calibration role is then established from the audit-report images themselves (byte-identical same-CPA pairs, the Firm A per-signature similarity distribution, partner-ranking concentration, and intra-report consistency), so the calibration does not depend on the practitioner-background claim being externally verified (Section III-H).
This calibration strategy has broader applicability beyond signature analysis.
Any forensic detection system operating on real-world corpora can benefit from identifying subpopulations with known dominant characteristics (positive or negative) to anchor threshold selection, particularly when the distributions of interest are non-normal and non-parametric or mixture-based thresholds are preferred over parametric alternatives.
@@ -97,15 +97,12 @@ This effect would bias classification toward false negatives rather than false p
Fourth, scanning equipment, PDF generation software, and compression algorithms may have changed over the 10-year study period (2013--2023), potentially affecting similarity measurements.
While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
Fifth, our cross-sectional analysis does not track individual CPAs longitudinally and therefore cannot confirm or rule out within-CPA mechanism transitions over the sample period (e.g., a CPA who hand-signed early in the sample and switched to firm-level e-signing later, or vice versa).
Extending the analysis to *auditor-year* units---computing per-signature statistics within each fiscal year and observing how individual CPAs move across years---is the natural next step for resolving such within-CPA transitions and is left to future work.
Sixth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
Fifth, the max/min detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed.
In the rare case that one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as the stamping or e-signature template, the pair correctly identifies image reuse but misattributes the non-hand-signed status to the source exemplar.
This misattribution affects at most one source document per template variant per CPA (the exemplar from which the template was produced), is not expected to be common given that stored signature templates are typically generated in a separate acquisition step rather than extracted from submitted audit reports, and does not materially affect aggregate capture rates at the firm level.
Seventh, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year."
Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments, because making such a translation would require an assumption of within-year uniformity of signing mechanisms that we do not adopt: a CPA's signatures within a single fiscal year may reflect a single replication template, multiple templates used in parallel (e.g., for different engagement positions or reporting pipelines), within-year mechanism mixing, or a combination, and the data at hand do not disambiguate these possibilities (Section III-G).
Sixth, our analyses remain at the signature level; we abstain from partner-level frequency inferences such as "X% of CPAs hand-sign in a given year."
Per-signature labels in this paper are not translated to per-report or per-partner mechanism assignments (Section III-G).
The signature-level rates we report, including the 92.5% / 7.5% Firm A split and the year-by-year left-tail share of Section IV-G.1, should accordingly be read as signature-level quantities rather than partner-level frequencies.
Finally, the legal and regulatory implications of our findings depend on jurisdictional definitions of "signature" and "signing."