Files
pdf_signature_extraction/paper/export_v3.py
T
gbanyan 53125d11d9 Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul
Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:44:49 +08:00

691 lines
28 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""Export Paper A v3 (IEEE Access target) to Word, reading from v3 md section files."""
from docx import Document
from docx.shared import Inches, Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
from pathlib import Path
import hashlib
import re
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
EQUATION_CACHE_DIR = PAPER_DIR / "equations"
EQUATION_CACHE_DIR.mkdir(exist_ok=True)
FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
SECTIONS = [
"paper_a_abstract_v3.md",
# paper_a_impact_statement_v3.md removed: not a standard IEEE Access
# Regular Paper section. Content folded into cover letter / abstract.
"paper_a_introduction_v3.md",
"paper_a_related_work_v3.md",
"paper_a_methodology_v3.md",
"paper_a_results_v3.md",
"paper_a_discussion_v3.md",
"paper_a_conclusion_v3.md",
# Appendix A: BD/McCrary bin-width sensitivity (see v3.7 notes).
"paper_a_appendix_v3.md",
# Declarations (COI / data availability / funding) before References,
# per IEEE Access convention.
"paper_a_declarations_v3.md",
"paper_a_references_v3.md",
]
# Figure insertion hooks (trigger phrase -> (file, caption, width inches)).
# New figures for v3: dip test, BD/McCrary overlays, accountant GMM 2D + marginals.
FIGURES = {
"Fig. 1 illustrates": (
FIG_DIR / "fig1_pipeline.png",
"Fig. 1. Pipeline architecture for automated non-hand-signed signature detection.",
6.5,
),
"Fig. 2 presents the cosine similarity distributions for intra-class": (
FIG_DIR / "fig2_intra_inter_kde.png",
"Fig. 2. Cosine similarity distributions: intra-class vs. inter-class with KDE crossover at 0.837.",
3.5,
),
"Fig. 3 presents the per-signature cosine and dHash distributions of Firm A": (
FIG_DIR / "fig3_firm_a_calibration.png",
"Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
3.5,
),
"Fig. 4 summarises the per-firm yearly per-signature": (
EXTRA_FIG_DIR / "figures" / "fig_yearly_big4_comparison.png",
"Fig. 4. Per-firm yearly per-signature best-match cosine, 2013-2023. (a) Mean per-signature best-match cosine by firm bucket and fiscal year (threshold-free). (b) Share of per-signature best-match cosine ≥ 0.95 (operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4. Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all four Big-4 firms in every year.",
6.5,
),
"conducted an ablation study comparing three": (
FIG_DIR / "fig4_ablation.png",
"Fig. 5. Ablation study comparing three feature extraction backbones.",
6.5,
),
}
def strip_comments(text):
"""Remove HTML comments, but UNWRAP comments whose first non-blank line
starts with `TABLE ` (or `TABLE\t`).
The v3 markdown sources wrap every numerical table in an HTML comment of
the form
<!-- TABLE V: Hartigan Dip Test Results
| Distribution | N | ... |
|--------------|---|-----|
| ... | … | ... |
-->
The caption (`TABLE V: Hartigan Dip Test Results`) is on the same line as
the opening `<!--`, the markdown table body is on the lines following,
and `-->` closes the block. The previous implementation wholesale-deleted
these comments, which silently dropped every table from the rendered
DOCX. We now (i) detect comments whose first non-empty line starts with
`TABLE `, (ii) emit a synthetic caption marker line `__TABLE_CAPTION__:
<caption>` so process_section can render the caption as a centered
bold paragraph above the table, and (iii) keep the table body so the
existing markdown-table detector picks it up. Non-TABLE comments
(figure placeholders, editorial notes) are stripped as before.
"""
def _replace(match):
body = match.group(1)
# Find first non-blank line.
for line in body.splitlines():
stripped = line.strip()
if stripped:
first = stripped
break
else:
return ""
if not first.startswith("TABLE ") and not first.startswith("TABLE\t"):
return ""
# Split caption (first non-blank line) from the rest.
lines = body.splitlines()
# Find index of the first non-blank line and use everything after.
for idx, line in enumerate(lines):
if line.strip():
caption = line.strip()
rest = "\n".join(lines[idx + 1:])
break
else:
return ""
# Emit caption marker + body. Surround with blank lines so the
# paragraph/table detector treats the marker as its own paragraph.
return f"\n\n__TABLE_CAPTION__:{caption}\n{rest}\n"
# Non-greedy match across lines.
return re.sub(r"<!--(.*?)-->", _replace, text, flags=re.DOTALL)
# ---------------------------------------------------------------------------
# LaTeX → plain text + Unicode conversion
# ---------------------------------------------------------------------------
# The v3 markdown sources contain inline LaTeX ($...$) and a small number of
# display-math blocks ($$...$$). Pandoc would render these natively; the
# python-docx pipeline used here does not, so without preprocessing every
# `\leq`, `\text{dHash}_\text{indep}`, `\Delta\text{BIC}`, `60{,}448`, etc.
# leaks into the DOCX as raw LaTeX. The helpers below convert the common
# inline cases to Unicode and split subscripts/superscripts into proper Word
# runs. Display-math (rare; 3 equations in this paper) gets a best-effort
# linearisation and is acceptable for a partner-handoff DOCX; final IEEE
# typesetting is handled by the publisher's LaTeX/MathType pipeline.
LATEX_TOKEN_REPLACEMENTS = [
# Greek letters (lower)
(r"\\alpha(?![A-Za-z])", "α"), (r"\\beta(?![A-Za-z])", "β"), (r"\\gamma(?![A-Za-z])", "γ"),
(r"\\delta(?![A-Za-z])", "δ"), (r"\\epsilon(?![A-Za-z])", "ε"), (r"\\zeta(?![A-Za-z])", "ζ"),
(r"\\eta(?![A-Za-z])", "η"), (r"\\theta(?![A-Za-z])", "θ"), (r"\\iota(?![A-Za-z])", "ι"),
(r"\\kappa(?![A-Za-z])", "κ"), (r"\\lambda(?![A-Za-z])", "λ"), (r"\\mu(?![A-Za-z])", "μ"),
(r"\\nu(?![A-Za-z])", "ν"), (r"\\xi(?![A-Za-z])", "ξ"), (r"\\pi(?![A-Za-z])", "π"),
(r"\\rho(?![A-Za-z])", "ρ"), (r"\\sigma(?![A-Za-z])", "σ"), (r"\\tau(?![A-Za-z])", "τ"),
(r"\\phi(?![A-Za-z])", "φ"), (r"\\chi(?![A-Za-z])", "χ"), (r"\\psi(?![A-Za-z])", "ψ"),
(r"\\omega(?![A-Za-z])", "ω"),
# Greek letters (upper, only those distinguishable from Latin)
(r"\\Gamma(?![A-Za-z])", "Γ"), (r"\\Delta(?![A-Za-z])", "Δ"), (r"\\Theta(?![A-Za-z])", "Θ"),
(r"\\Lambda(?![A-Za-z])", "Λ"), (r"\\Xi(?![A-Za-z])", "Ξ"), (r"\\Pi(?![A-Za-z])", "Π"),
(r"\\Sigma(?![A-Za-z])", "Σ"), (r"\\Phi(?![A-Za-z])", "Φ"), (r"\\Psi(?![A-Za-z])", "Ψ"),
(r"\\Omega(?![A-Za-z])", "Ω"),
# Relations / arrows
(r"\\leq(?![A-Za-z])", ""), (r"\\geq(?![A-Za-z])", ""),
(r"\\neq(?![A-Za-z])", ""), (r"\\approx(?![A-Za-z])", ""),
(r"\\equiv(?![A-Za-z])", ""), (r"\\sim(?![A-Za-z])", "~"),
(r"\\to(?![A-Za-z])", ""), (r"\\rightarrow(?![A-Za-z])", ""),
(r"\\leftarrow(?![A-Za-z])", ""), (r"\\Rightarrow(?![A-Za-z])", ""),
(r"\\Leftarrow(?![A-Za-z])", ""),
# Binary operators
(r"\\times(?![A-Za-z])", "×"), (r"\\cdot(?![A-Za-z])", "·"),
(r"\\pm(?![A-Za-z])", "±"), (r"\\mp(?![A-Za-z])", ""),
(r"\\div(?![A-Za-z])", "÷"),
# Misc
(r"\\infty(?![A-Za-z])", ""), (r"\\partial(?![A-Za-z])", ""),
(r"\\sum(?![A-Za-z])", ""), (r"\\prod(?![A-Za-z])", ""),
(r"\\int(?![A-Za-z])", ""),
(r"\\ldots(?![A-Za-z])", ""), (r"\\dots(?![A-Za-z])", ""),
# Spacing commands (drop or replace with single space)
(r"\\,", " "), (r"\\;", " "), (r"\\:", " "),
(r"\\!", ""), (r"\\ ", " "),
(r"\\quad(?![A-Za-z])", " "), (r"\\qquad(?![A-Za-z])", " "),
# Escaped punctuation
(r"\\%", "%"), (r"\\#", "#"), (r"\\&", "&"),
(r"\\\$", "$"), (r"\\_", "_"),
]
def _unwrap_command(text, cmd):
"""Repeatedly replace `\\cmd{X}` → `X` until stable."""
pat = re.compile(r"\\" + cmd + r"\{([^{}]*)\}")
prev = None
while prev != text:
prev = text
text = pat.sub(r"\1", text)
return text
MATH_START = "" # Private Use Area: XML-safe
MATH_END = ""
def latex_to_unicode(text):
"""Convert a LaTeX-laced markdown paragraph into plain text.
Math context is preserved with private-use sentinel characters
(MATH_START / MATH_END) so the downstream run-splitter only treats
`_X` / `^X` as subscript / superscript inside math regions; in body
text underscores in identifiers like `signature_analysis` survive.
"""
if "$" not in text and "\\" not in text:
return text
# 1. Strip display-math delimiters first (keep the inner content for
# best-effort linearisation), wrapping math regions with sentinels.
# Then strip inline math delimiters with the same sentinel wrapping.
text = re.sub(r"\$\$([\s\S]+?)\$\$",
lambda m: MATH_START + m.group(1) + MATH_END, text)
text = re.sub(r"\$([^$]+?)\$",
lambda m: MATH_START + m.group(1) + MATH_END, text)
# 2. Replace token-level commands with Unicode glyphs *before* unwrapping
# `\text{...}` and friends, so that `\Delta\text{BIC}` becomes
# `Δ\text{BIC}` (then `ΔBIC`) rather than `\DeltaBIC` which would be
# stripped wholesale by the cleanup pass.
for pat, repl in LATEX_TOKEN_REPLACEMENTS:
text = re.sub(pat, repl, text)
# 3. Unwrap formatting / text commands (innermost first via _unwrap loop).
for cmd in ("text", "mathbf", "mathit", "mathrm", "mathsf", "mathtt",
"operatorname", "emph", "textbf", "textit"):
text = _unwrap_command(text, cmd)
# 4. \frac{a}{b} → (a)/(b); \sqrt{x} → √(x). Apply repeatedly to handle
# one level of nesting; deeper nesting is rare in this paper.
for _ in range(3):
text = re.sub(
r"\\t?frac\{([^{}]+)\}\{([^{}]+)\}",
r"(\1)/(\2)",
text,
)
text = re.sub(r"\\sqrt\{([^{}]+)\}", r"√(\1)", text)
# 5. TeX braces used purely for spacing/grouping: K{=}3 → K=3,
# 60{,}448 → 60,448, 10{,}175 → 10,175.
text = re.sub(r"\{([=<>+\-,])\}", r"\1", text)
# 6. Strip any remaining `\cmd{...}` (best effort) and `\cmd ` tokens.
text = re.sub(r"\\[a-zA-Z]+\{([^{}]*)\}", r"\1", text)
text = re.sub(r"\\[a-zA-Z]+(?![A-Za-z])", "", text)
# 7. Collapse runs of whitespace introduced by command stripping.
text = re.sub(r"[ \t]{2,}", " ", text)
return text
_SUBSUP_PATTERN = re.compile(
r"_\{([^{}]*)\}" # _{...}
r"|\^\{([^{}]*)\}" # ^{...}
r"|_([A-Za-z0-9+\-])" # _X (single token)
r"|\^([A-Za-z0-9+\-])" # ^X (single token)
)
def _emit_plain(paragraph, text, font_name, font_size, bold, italic):
if not text:
return
run = paragraph.add_run(text)
run.font.name = font_name
run.font.size = font_size
run.bold = bold
run.italic = italic
def _emit_math(paragraph, text, font_name, font_size, bold, italic):
"""Emit `text` from a math region: split on `_X` / `_{X}` / `^X` / `^{X}`
and render those as Word subscripts / superscripts."""
if "_" not in text and "^" not in text:
_emit_plain(paragraph, text, font_name, font_size, bold, italic)
return
pos = 0
for m in _SUBSUP_PATTERN.finditer(text):
if m.start() > pos:
_emit_plain(paragraph, text[pos:m.start()],
font_name, font_size, bold, italic)
sub_text = m.group(1) or m.group(3)
sup_text = m.group(2) or m.group(4)
if sub_text is not None:
run = paragraph.add_run(sub_text)
run.font.subscript = True
else:
run = paragraph.add_run(sup_text)
run.font.superscript = True
run.font.name = font_name
run.font.size = font_size
run.bold = bold
run.italic = italic
pos = m.end()
if pos < len(text):
_emit_plain(paragraph, text[pos:],
font_name, font_size, bold, italic)
def add_text_with_subsup(paragraph, text, font_name="Times New Roman",
font_size=Pt(10), bold=False, italic=False):
"""Add `text` to `paragraph`. Subscript/superscript handling is scoped to
math regions delimited by MATH_START / MATH_END sentinels (set up by
`latex_to_unicode`). Outside math regions, underscores and carets are
preserved literally so identifiers like `signature_analysis` and
`paper_a_results_v3.md` survive intact.
"""
if MATH_START not in text:
_emit_math(paragraph, text, font_name, font_size, bold, italic) \
if False else \
_emit_plain(paragraph, text, font_name, font_size, bold, italic)
return
pos = 0
while pos < len(text):
s = text.find(MATH_START, pos)
if s == -1:
_emit_plain(paragraph, text[pos:],
font_name, font_size, bold, italic)
break
if s > pos:
_emit_plain(paragraph, text[pos:s],
font_name, font_size, bold, italic)
e = text.find(MATH_END, s + 1)
if e == -1:
# Unterminated math region — emit rest as plain.
_emit_plain(paragraph, text[s + 1:],
font_name, font_size, bold, italic)
break
math_body = text[s + 1:e]
_emit_math(paragraph, math_body, font_name, font_size, bold, italic)
pos = e + 1
# ---------------------------------------------------------------------------
# Display-equation rendering (matplotlib mathtext → PNG → embedded image)
# ---------------------------------------------------------------------------
# matplotlib mathtext is a subset of LaTeX. A few common TeX-only macros need
# to be substituted with mathtext-supported equivalents before parsing.
_MATHTEXT_SUBS = [
(re.compile(r"\\tfrac\b"), r"\\frac"), # text-frac → frac
(re.compile(r"\\dfrac\b"), r"\\frac"), # display-frac → frac
(re.compile(r"\\operatorname\{([^{}]+)\}"),
lambda m: r"\mathrm{" + m.group(1) + "}"), # operatorname → mathrm
(re.compile(r"\\,"), " "), # thin space
(re.compile(r"\\;"), " "),
(re.compile(r"\\!"), ""),
]
def _sanitise_for_mathtext(latex: str) -> str:
out = latex
for pat, repl in _MATHTEXT_SUBS:
out = pat.sub(repl, out)
return out
def render_equation_png(latex: str, fontsize: int = 14) -> Path:
"""Render a LaTeX math expression to a tightly-cropped PNG using
matplotlib mathtext, with content-addressed caching so a re-build only
re-renders changed equations. Returns the cached PNG path."""
sanitised = _sanitise_for_mathtext(latex.strip())
digest = hashlib.sha1(
(sanitised + f"|fs{fontsize}").encode("utf-8")).hexdigest()[:16]
out_path = EQUATION_CACHE_DIR / f"eq_{digest}.png"
if out_path.exists():
return out_path
fig = plt.figure(figsize=(8, 1.6))
fig.text(0.5, 0.5, f"${sanitised}$",
fontsize=fontsize, ha="center", va="center")
fig.savefig(str(out_path), dpi=220, bbox_inches="tight",
pad_inches=0.05)
plt.close(fig)
return out_path
def add_equation_block(doc, latex: str, equation_number: int,
width_inches: float = 4.5):
"""Insert a centered display equation (rendered as PNG) followed by
a right-aligned equation number `(N)`. Width keeps the equation
visually proportional within the IEEE Access body column."""
img_path = render_equation_png(latex)
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
p.paragraph_format.space_before = Pt(6)
p.paragraph_format.space_after = Pt(6)
run = p.add_run()
run.add_picture(str(img_path), width=Inches(width_inches))
# Equation number on the same paragraph, tab-aligned to the right.
num_run = p.add_run(f"\t({equation_number})")
num_run.font.name = "Times New Roman"
num_run.font.size = Pt(10)
def add_md_table(doc, table_lines):
rows_data = []
for line in table_lines:
cells = [c.strip() for c in line.strip("|").split("|")]
if not re.match(r"^[-: ]+$", cells[0]):
rows_data.append(cells)
if len(rows_data) < 2:
return
ncols = len(rows_data[0])
table = doc.add_table(rows=len(rows_data), cols=ncols)
table.style = "Table Grid"
for r_idx, row in enumerate(rows_data):
for c_idx in range(min(len(row), ncols)):
cell = table.rows[r_idx].cells[c_idx]
raw = row[c_idx]
# Strip markdown emphasis markers; convert LaTeX before rendering.
raw = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", raw)
raw = re.sub(r"\*\*(.+?)\*\*", r"\1", raw)
raw = re.sub(r"\*(.+?)\*", r"\1", raw)
raw = re.sub(r"`(.+?)`", r"\1", raw)
cell_text = latex_to_unicode(raw)
# Replace the default empty paragraph with one we control.
cell.text = ""
cp = cell.paragraphs[0]
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
add_text_with_subsup(
cp, cell_text,
font_name="Times New Roman",
font_size=Pt(8),
bold=(r_idx == 0),
)
doc.add_paragraph()
def _insert_figures(doc, para_text):
for trigger, (fig_path, caption, width) in FIGURES.items():
if trigger in para_text and Path(fig_path).exists():
fp = doc.add_paragraph()
fp.alignment = WD_ALIGN_PARAGRAPH.CENTER
fr = fp.add_run()
fr.add_picture(str(fig_path), width=Inches(width))
cp = doc.add_paragraph()
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
cr = cp.add_run(caption)
cr.font.size = Pt(9)
cr.font.name = "Times New Roman"
cr.italic = True
def process_section(doc, filepath, equation_counter=None):
"""Process one v3 markdown section. `equation_counter` is a single-element
list (used as a mutable counter shared across sections) tracking the
running display-equation number."""
if equation_counter is None:
equation_counter = [0]
text = filepath.read_text(encoding="utf-8")
text = strip_comments(text)
lines = text.split("\n")
# Defensive blockquote handling: markdown blockquote lines (`> body`) are
# not rendered as Word callout blocks here, but stripping the leading
# `> ` keeps the body text from leaking the literal `>` and the empty
# `>` separator lines into the DOCX.
cleaned = []
for ln in lines:
s = ln.lstrip()
if s == ">" or s.startswith("> "):
cleaned.append(ln[ln.index(">") + 1:].lstrip() if "> " in ln else "")
else:
cleaned.append(ln)
lines = cleaned
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
if not stripped:
i += 1
continue
if stripped.startswith("# "):
h = doc.add_heading(
latex_to_unicode(stripped[2:]).replace(MATH_START, "").replace(MATH_END, ""),
level=1)
for run in h.runs:
run.font.color.rgb = RGBColor(0, 0, 0)
i += 1
continue
if stripped.startswith("## "):
h = doc.add_heading(
latex_to_unicode(stripped[3:]).replace(MATH_START, "").replace(MATH_END, ""),
level=2)
for run in h.runs:
run.font.color.rgb = RGBColor(0, 0, 0)
i += 1
continue
if stripped.startswith("### "):
h = doc.add_heading(
latex_to_unicode(stripped[4:]).replace(MATH_START, "").replace(MATH_END, ""),
level=3)
for run in h.runs:
run.font.color.rgb = RGBColor(0, 0, 0)
i += 1
continue
if stripped.startswith("__TABLE_CAPTION__:"):
caption_text = stripped[len("__TABLE_CAPTION__:"):].strip()
caption_text = latex_to_unicode(caption_text)
cp = doc.add_paragraph()
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
cp.paragraph_format.space_before = Pt(6)
cp.paragraph_format.space_after = Pt(2)
add_text_with_subsup(
cp, caption_text,
font_name="Times New Roman",
font_size=Pt(9),
bold=True,
)
i += 1
continue
if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
table_lines = []
while i < len(lines) and "|" in lines[i]:
table_lines.append(lines[i])
i += 1
add_md_table(doc, table_lines)
continue
# Display math: a line starting with `$$` is treated as a single-line
# equation block and rendered as an embedded mathtext PNG with an
# auto-incrementing equation number.
if stripped.startswith("$$"):
# Accumulate until a closing $$ is found (single line in our
# corpus, but defensively support multi-line just in case).
buf = [stripped]
if not (stripped.count("$$") >= 2 and stripped.endswith("$$")):
while i + 1 < len(lines):
i += 1
buf.append(lines[i])
if "$$" in lines[i]:
break
joined = "\n".join(buf).strip()
# Strip the leading and trailing $$ delimiters and any trailing
# punctuation (e.g. the `,` that some equation lines end with).
inner = joined
if inner.startswith("$$"):
inner = inner[2:]
if inner.endswith("$$"):
inner = inner[:-2]
inner = inner.rstrip(", ")
equation_counter[0] += 1
try:
add_equation_block(doc, inner, equation_counter[0])
except Exception as exc:
# Fallback: render as plain centered Times-Roman line so the
# build doesn't fail on a single un-renderable equation.
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
run = p.add_run(f"[equation render failed: {exc}] {inner}")
run.font.name = "Times New Roman"
run.font.size = Pt(10)
run.italic = True
i += 1
continue
if re.match(r"^\d+\.\s", stripped):
# Manual numbering: keep the number from the markdown source and
# apply a hanging-indent paragraph format. Avoids python-docx's
# `style='List Number'` which depends on a properly-set-up
# numbering definition that the default Document() lacks.
m = re.match(r"^(\d+)\.\s+(.*)$", stripped)
num, content = m.group(1), m.group(2)
p = doc.add_paragraph()
p.paragraph_format.left_indent = Inches(0.4)
p.paragraph_format.first_line_indent = Inches(-0.25)
p.paragraph_format.space_after = Pt(4)
content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
content = re.sub(r"\*(.+?)\*", r"\1", content)
content = re.sub(r"`(.+?)`", r"\1", content)
content = latex_to_unicode(content)
add_text_with_subsup(p, f"{num}. {content}")
i += 1
continue
if stripped.startswith("- "):
# Manual bullets with hanging indent (same rationale as numbered).
p = doc.add_paragraph()
p.paragraph_format.left_indent = Inches(0.4)
p.paragraph_format.first_line_indent = Inches(-0.25)
p.paragraph_format.space_after = Pt(4)
content = stripped[2:]
content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
content = re.sub(r"\*(.+?)\*", r"\1", content)
content = re.sub(r"`(.+?)`", r"\1", content)
content = latex_to_unicode(content)
add_text_with_subsup(p, f"{content}")
i += 1
continue
# Regular paragraph
para_lines = [stripped]
i += 1
while i < len(lines):
nxt = lines[i].strip()
if (
not nxt
or nxt.startswith("#")
or nxt.startswith("|")
or nxt.startswith("- ")
or re.match(r"^\d+\.\s", nxt)
):
break
para_lines.append(nxt)
i += 1
para_text = " ".join(para_lines)
para_text = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", para_text)
para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
para_text = re.sub(r"`(.+?)`", r"\1", para_text)
para_text = para_text.replace("---", "\u2014")
para_text = latex_to_unicode(para_text)
p = doc.add_paragraph()
p.paragraph_format.space_after = Pt(6)
add_text_with_subsup(p, para_text)
_insert_figures(doc, para_text)
def main():
doc = Document()
style = doc.styles["Normal"]
style.font.name = "Times New Roman"
style.font.size = Pt(10)
# Title page
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
p.paragraph_format.space_after = Pt(12)
run = p.add_run(
"Automated Identification of Non-Hand-Signed Auditor Signatures\n"
"in Large-Scale Financial Audit Reports:\n"
"A Dual-Descriptor Framework with Replication-Dominated Calibration"
)
run.font.size = Pt(16)
run.font.name = "Times New Roman"
run.bold = True
# IEEE Access uses single-anonymized review: author / affiliation
# / corresponding-author block must appear on the title page in the
# final submission. Fill these placeholders with real metadata
# before submitting the generated DOCX.
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
p.paragraph_format.space_after = Pt(6)
run = p.add_run("[AUTHOR NAMES — fill in before submission]")
run.font.size = Pt(11)
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
p.paragraph_format.space_after = Pt(6)
run = p.add_run("[Affiliations and corresponding-author email — fill in before submission]")
run.font.size = Pt(10)
run.italic = True
p = doc.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
p.paragraph_format.space_after = Pt(20)
run = p.add_run("Target journal: IEEE Access (Regular Paper, single-anonymized review)")
run.font.size = Pt(10)
run.italic = True
equation_counter = [0]
for section_file in SECTIONS:
filepath = PAPER_DIR / section_file
if filepath.exists():
process_section(doc, filepath, equation_counter=equation_counter)
else:
print(f"WARNING: missing section file: {filepath}")
doc.save(str(OUTPUT))
print(f"Saved: {OUTPUT}")
_run_linter()
def _run_linter():
"""Run the leak linter on the freshly built DOCX. Non-fatal: prints a
summary line. For full output run `python3 paper/lint_paper_v3.py`."""
try:
import lint_paper_v3 # local module
except Exception as exc: # pragma: no cover
print(f"(lint skipped: {exc})")
return
findings = lint_paper_v3.lint_docx(OUTPUT)
errors = sum(1 for f in findings if f.severity == "ERROR")
warns = sum(1 for f in findings if f.severity == "WARN")
infos = sum(1 for f in findings if f.severity == "INFO")
if errors:
print(f"\n[lint] {errors} ERROR finding(s) in DOCX — run "
f"`python3 paper/lint_paper_v3.py --docx` for details.")
elif warns or infos:
print(f"[lint] DOCX clean of ERRORs ({warns} WARN, {infos} INFO).")
else:
print("[lint] DOCX clean.")
if __name__ == "__main__":
main()