Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul
Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1): Must-fix items (6/6): - §III-F SSIM/pixel rejection rewritten from first principles (design-level argument from luminance/contrast/structure local-window product, not the prior empirical 0.70 result) - Table VI restructured by population × method; added missing Firm A logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary marked bin-unstable (Appendix A) - Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted from "operational dual" to "calibration-fold-adjacent reference"; the actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout - New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A on top); script 30_yearly_big4_comparison.py - Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets - §III-K reframed P7.5 from "round-number lower-tail boundary" to operating point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds: 0.9407 / 0.945 / 0.95 / 0.977 / 0.985) Nice-to-have items (3/3): - Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985) - Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to table notes; cut "invite reviewer skepticism" and "non-load-bearing" Codex 3-pass verification cleanup: - Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A Beta-2 forced-fit crossing from beta_mixture_results.json) - dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer immediately below) instead of misleading "rounded down" - Table XII-B prose corrected: per-segment qualification of "non-Firm-A capture falls faster" (true on 0.95→0.977 segment but contracts on 0.977→0.985 segment); arithmetic now from exact counts Within-year analyses removed: - Within-year ranking robustness check (Class A) was added in nice-to-have pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the Appendix B provenance row - Within-CPA future-work disclosures (Class B) removed from Discussion limitation #5 and Conclusion future-work paragraph; subsequent limitations renumbered Sixth → Fifth, Seventh → Sixth DOCX rendering pipeline overhaul (paper/export_v3.py): Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES: strip_comments() was wholesale-deleting HTML comments, but every numerical table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted alongside the wrapper. Now unwraps TABLE comments (emit synthetic __TABLE_CAPTION__: marker + table body) while still stripping non-TABLE editorial comments. Result: 19 tables now render in the DOCX. Other rendering fixes: - LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥, ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,}) - Math-context-scoped sub/superscript via PUA sentinels (/): no more underscore-eating in identifiers like signature_analysis - Display equations rendered via matplotlib mathtext to PNG (3 equations: cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as numbered equation blocks (1), (2), (3); content-addressed cache at paper/equations/ (gitignored, regenerable) - Manual numbered/bulleted list rendering with hanging indent (replaces python-docx style="List Number" which silently drops the number prefix when no numbering definition is bound) - Markdown blockquote (> ...) defensively stripped - Pandoc footnote ([^name]) markers no longer leak (inlined at source) - Heading text cleaned of LaTeX residue + PUA sentinels - File paths in body text (signature_analysis/X.py, reports/Y.json) trimmed to "(reproduction artifact in Appendix B)" pointers New leak linter: paper/lint_paper_v3.py - two-pass markdown source + rendered DOCX leak detector; auto-runs at end of export_v3.py. Script changes: - 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR threshold list so Table XII-B is reproducible from persisted JSON - 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly data (writes to reports/figures/ and reports/firm_yearly_comparison/) - 31_within_year_ranking_robustness.py: NEW; supports the within-year robustness check (no longer cited in paper but kept as repo-internal due-diligence artifact) Partner handoff DOCX shipped to ~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB: 19 tables + 4 figures + 3 equation images). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+472
-31
@@ -5,9 +5,16 @@ from docx import Document
|
||||
from docx.shared import Inches, Pt, RGBColor
|
||||
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
||||
from pathlib import Path
|
||||
import hashlib
|
||||
import re
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
|
||||
EQUATION_CACHE_DIR = PAPER_DIR / "equations"
|
||||
EQUATION_CACHE_DIR.mkdir(exist_ok=True)
|
||||
FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
|
||||
EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
|
||||
OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
|
||||
@@ -48,10 +55,10 @@ FIGURES = {
|
||||
"Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
|
||||
3.5,
|
||||
),
|
||||
"Fig. 4 visualizes the accountant-level clusters": (
|
||||
EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
|
||||
"Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
|
||||
4.5,
|
||||
"Fig. 4 summarises the per-firm yearly per-signature": (
|
||||
EXTRA_FIG_DIR / "figures" / "fig_yearly_big4_comparison.png",
|
||||
"Fig. 4. Per-firm yearly per-signature best-match cosine, 2013-2023. (a) Mean per-signature best-match cosine by firm bucket and fiscal year (threshold-free). (b) Share of per-signature best-match cosine ≥ 0.95 (operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4. Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all four Big-4 firms in every year.",
|
||||
6.5,
|
||||
),
|
||||
"conducted an ablation study comparing three": (
|
||||
FIG_DIR / "fig4_ablation.png",
|
||||
@@ -62,7 +69,321 @@ FIGURES = {
|
||||
|
||||
|
||||
def strip_comments(text):
|
||||
return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
|
||||
"""Remove HTML comments, but UNWRAP comments whose first non-blank line
|
||||
starts with `TABLE ` (or `TABLE\t`).
|
||||
|
||||
The v3 markdown sources wrap every numerical table in an HTML comment of
|
||||
the form
|
||||
|
||||
<!-- TABLE V: Hartigan Dip Test Results
|
||||
| Distribution | N | ... |
|
||||
|--------------|---|-----|
|
||||
| ... | … | ... |
|
||||
-->
|
||||
|
||||
The caption (`TABLE V: Hartigan Dip Test Results`) is on the same line as
|
||||
the opening `<!--`, the markdown table body is on the lines following,
|
||||
and `-->` closes the block. The previous implementation wholesale-deleted
|
||||
these comments, which silently dropped every table from the rendered
|
||||
DOCX. We now (i) detect comments whose first non-empty line starts with
|
||||
`TABLE `, (ii) emit a synthetic caption marker line `__TABLE_CAPTION__:
|
||||
<caption>` so process_section can render the caption as a centered
|
||||
bold paragraph above the table, and (iii) keep the table body so the
|
||||
existing markdown-table detector picks it up. Non-TABLE comments
|
||||
(figure placeholders, editorial notes) are stripped as before.
|
||||
"""
|
||||
def _replace(match):
|
||||
body = match.group(1)
|
||||
# Find first non-blank line.
|
||||
for line in body.splitlines():
|
||||
stripped = line.strip()
|
||||
if stripped:
|
||||
first = stripped
|
||||
break
|
||||
else:
|
||||
return ""
|
||||
if not first.startswith("TABLE ") and not first.startswith("TABLE\t"):
|
||||
return ""
|
||||
# Split caption (first non-blank line) from the rest.
|
||||
lines = body.splitlines()
|
||||
# Find index of the first non-blank line and use everything after.
|
||||
for idx, line in enumerate(lines):
|
||||
if line.strip():
|
||||
caption = line.strip()
|
||||
rest = "\n".join(lines[idx + 1:])
|
||||
break
|
||||
else:
|
||||
return ""
|
||||
# Emit caption marker + body. Surround with blank lines so the
|
||||
# paragraph/table detector treats the marker as its own paragraph.
|
||||
return f"\n\n__TABLE_CAPTION__:{caption}\n{rest}\n"
|
||||
# Non-greedy match across lines.
|
||||
return re.sub(r"<!--(.*?)-->", _replace, text, flags=re.DOTALL)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LaTeX → plain text + Unicode conversion
|
||||
# ---------------------------------------------------------------------------
|
||||
# The v3 markdown sources contain inline LaTeX ($...$) and a small number of
|
||||
# display-math blocks ($$...$$). Pandoc would render these natively; the
|
||||
# python-docx pipeline used here does not, so without preprocessing every
|
||||
# `\leq`, `\text{dHash}_\text{indep}`, `\Delta\text{BIC}`, `60{,}448`, etc.
|
||||
# leaks into the DOCX as raw LaTeX. The helpers below convert the common
|
||||
# inline cases to Unicode and split subscripts/superscripts into proper Word
|
||||
# runs. Display-math (rare; 3 equations in this paper) gets a best-effort
|
||||
# linearisation and is acceptable for a partner-handoff DOCX; final IEEE
|
||||
# typesetting is handled by the publisher's LaTeX/MathType pipeline.
|
||||
|
||||
LATEX_TOKEN_REPLACEMENTS = [
|
||||
# Greek letters (lower)
|
||||
(r"\\alpha(?![A-Za-z])", "α"), (r"\\beta(?![A-Za-z])", "β"), (r"\\gamma(?![A-Za-z])", "γ"),
|
||||
(r"\\delta(?![A-Za-z])", "δ"), (r"\\epsilon(?![A-Za-z])", "ε"), (r"\\zeta(?![A-Za-z])", "ζ"),
|
||||
(r"\\eta(?![A-Za-z])", "η"), (r"\\theta(?![A-Za-z])", "θ"), (r"\\iota(?![A-Za-z])", "ι"),
|
||||
(r"\\kappa(?![A-Za-z])", "κ"), (r"\\lambda(?![A-Za-z])", "λ"), (r"\\mu(?![A-Za-z])", "μ"),
|
||||
(r"\\nu(?![A-Za-z])", "ν"), (r"\\xi(?![A-Za-z])", "ξ"), (r"\\pi(?![A-Za-z])", "π"),
|
||||
(r"\\rho(?![A-Za-z])", "ρ"), (r"\\sigma(?![A-Za-z])", "σ"), (r"\\tau(?![A-Za-z])", "τ"),
|
||||
(r"\\phi(?![A-Za-z])", "φ"), (r"\\chi(?![A-Za-z])", "χ"), (r"\\psi(?![A-Za-z])", "ψ"),
|
||||
(r"\\omega(?![A-Za-z])", "ω"),
|
||||
# Greek letters (upper, only those distinguishable from Latin)
|
||||
(r"\\Gamma(?![A-Za-z])", "Γ"), (r"\\Delta(?![A-Za-z])", "Δ"), (r"\\Theta(?![A-Za-z])", "Θ"),
|
||||
(r"\\Lambda(?![A-Za-z])", "Λ"), (r"\\Xi(?![A-Za-z])", "Ξ"), (r"\\Pi(?![A-Za-z])", "Π"),
|
||||
(r"\\Sigma(?![A-Za-z])", "Σ"), (r"\\Phi(?![A-Za-z])", "Φ"), (r"\\Psi(?![A-Za-z])", "Ψ"),
|
||||
(r"\\Omega(?![A-Za-z])", "Ω"),
|
||||
# Relations / arrows
|
||||
(r"\\leq(?![A-Za-z])", "≤"), (r"\\geq(?![A-Za-z])", "≥"),
|
||||
(r"\\neq(?![A-Za-z])", "≠"), (r"\\approx(?![A-Za-z])", "≈"),
|
||||
(r"\\equiv(?![A-Za-z])", "≡"), (r"\\sim(?![A-Za-z])", "~"),
|
||||
(r"\\to(?![A-Za-z])", "→"), (r"\\rightarrow(?![A-Za-z])", "→"),
|
||||
(r"\\leftarrow(?![A-Za-z])", "←"), (r"\\Rightarrow(?![A-Za-z])", "⇒"),
|
||||
(r"\\Leftarrow(?![A-Za-z])", "⇐"),
|
||||
# Binary operators
|
||||
(r"\\times(?![A-Za-z])", "×"), (r"\\cdot(?![A-Za-z])", "·"),
|
||||
(r"\\pm(?![A-Za-z])", "±"), (r"\\mp(?![A-Za-z])", "∓"),
|
||||
(r"\\div(?![A-Za-z])", "÷"),
|
||||
# Misc
|
||||
(r"\\infty(?![A-Za-z])", "∞"), (r"\\partial(?![A-Za-z])", "∂"),
|
||||
(r"\\sum(?![A-Za-z])", "∑"), (r"\\prod(?![A-Za-z])", "∏"),
|
||||
(r"\\int(?![A-Za-z])", "∫"),
|
||||
(r"\\ldots(?![A-Za-z])", "…"), (r"\\dots(?![A-Za-z])", "…"),
|
||||
# Spacing commands (drop or replace with single space)
|
||||
(r"\\,", " "), (r"\\;", " "), (r"\\:", " "),
|
||||
(r"\\!", ""), (r"\\ ", " "),
|
||||
(r"\\quad(?![A-Za-z])", " "), (r"\\qquad(?![A-Za-z])", " "),
|
||||
# Escaped punctuation
|
||||
(r"\\%", "%"), (r"\\#", "#"), (r"\\&", "&"),
|
||||
(r"\\\$", "$"), (r"\\_", "_"),
|
||||
]
|
||||
|
||||
|
||||
def _unwrap_command(text, cmd):
|
||||
"""Repeatedly replace `\\cmd{X}` → `X` until stable."""
|
||||
pat = re.compile(r"\\" + cmd + r"\{([^{}]*)\}")
|
||||
prev = None
|
||||
while prev != text:
|
||||
prev = text
|
||||
text = pat.sub(r"\1", text)
|
||||
return text
|
||||
|
||||
|
||||
MATH_START = "" # Private Use Area: XML-safe
|
||||
MATH_END = ""
|
||||
|
||||
|
||||
def latex_to_unicode(text):
|
||||
"""Convert a LaTeX-laced markdown paragraph into plain text.
|
||||
|
||||
Math context is preserved with private-use sentinel characters
|
||||
(MATH_START / MATH_END) so the downstream run-splitter only treats
|
||||
`_X` / `^X` as subscript / superscript inside math regions; in body
|
||||
text underscores in identifiers like `signature_analysis` survive.
|
||||
"""
|
||||
if "$" not in text and "\\" not in text:
|
||||
return text
|
||||
|
||||
# 1. Strip display-math delimiters first (keep the inner content for
|
||||
# best-effort linearisation), wrapping math regions with sentinels.
|
||||
# Then strip inline math delimiters with the same sentinel wrapping.
|
||||
text = re.sub(r"\$\$([\s\S]+?)\$\$",
|
||||
lambda m: MATH_START + m.group(1) + MATH_END, text)
|
||||
text = re.sub(r"\$([^$]+?)\$",
|
||||
lambda m: MATH_START + m.group(1) + MATH_END, text)
|
||||
|
||||
# 2. Replace token-level commands with Unicode glyphs *before* unwrapping
|
||||
# `\text{...}` and friends, so that `\Delta\text{BIC}` becomes
|
||||
# `Δ\text{BIC}` (then `ΔBIC`) rather than `\DeltaBIC` which would be
|
||||
# stripped wholesale by the cleanup pass.
|
||||
for pat, repl in LATEX_TOKEN_REPLACEMENTS:
|
||||
text = re.sub(pat, repl, text)
|
||||
|
||||
# 3. Unwrap formatting / text commands (innermost first via _unwrap loop).
|
||||
for cmd in ("text", "mathbf", "mathit", "mathrm", "mathsf", "mathtt",
|
||||
"operatorname", "emph", "textbf", "textit"):
|
||||
text = _unwrap_command(text, cmd)
|
||||
|
||||
# 4. \frac{a}{b} → (a)/(b); \sqrt{x} → √(x). Apply repeatedly to handle
|
||||
# one level of nesting; deeper nesting is rare in this paper.
|
||||
for _ in range(3):
|
||||
text = re.sub(
|
||||
r"\\t?frac\{([^{}]+)\}\{([^{}]+)\}",
|
||||
r"(\1)/(\2)",
|
||||
text,
|
||||
)
|
||||
text = re.sub(r"\\sqrt\{([^{}]+)\}", r"√(\1)", text)
|
||||
|
||||
# 5. TeX braces used purely for spacing/grouping: K{=}3 → K=3,
|
||||
# 60{,}448 → 60,448, 10{,}175 → 10,175.
|
||||
text = re.sub(r"\{([=<>+\-,])\}", r"\1", text)
|
||||
|
||||
# 6. Strip any remaining `\cmd{...}` (best effort) and `\cmd ` tokens.
|
||||
text = re.sub(r"\\[a-zA-Z]+\{([^{}]*)\}", r"\1", text)
|
||||
text = re.sub(r"\\[a-zA-Z]+(?![A-Za-z])", "", text)
|
||||
|
||||
# 7. Collapse runs of whitespace introduced by command stripping.
|
||||
text = re.sub(r"[ \t]{2,}", " ", text)
|
||||
return text
|
||||
|
||||
|
||||
_SUBSUP_PATTERN = re.compile(
|
||||
r"_\{([^{}]*)\}" # _{...}
|
||||
r"|\^\{([^{}]*)\}" # ^{...}
|
||||
r"|_([A-Za-z0-9+\-])" # _X (single token)
|
||||
r"|\^([A-Za-z0-9+\-])" # ^X (single token)
|
||||
)
|
||||
|
||||
|
||||
def _emit_plain(paragraph, text, font_name, font_size, bold, italic):
|
||||
if not text:
|
||||
return
|
||||
run = paragraph.add_run(text)
|
||||
run.font.name = font_name
|
||||
run.font.size = font_size
|
||||
run.bold = bold
|
||||
run.italic = italic
|
||||
|
||||
|
||||
def _emit_math(paragraph, text, font_name, font_size, bold, italic):
|
||||
"""Emit `text` from a math region: split on `_X` / `_{X}` / `^X` / `^{X}`
|
||||
and render those as Word subscripts / superscripts."""
|
||||
if "_" not in text and "^" not in text:
|
||||
_emit_plain(paragraph, text, font_name, font_size, bold, italic)
|
||||
return
|
||||
pos = 0
|
||||
for m in _SUBSUP_PATTERN.finditer(text):
|
||||
if m.start() > pos:
|
||||
_emit_plain(paragraph, text[pos:m.start()],
|
||||
font_name, font_size, bold, italic)
|
||||
sub_text = m.group(1) or m.group(3)
|
||||
sup_text = m.group(2) or m.group(4)
|
||||
if sub_text is not None:
|
||||
run = paragraph.add_run(sub_text)
|
||||
run.font.subscript = True
|
||||
else:
|
||||
run = paragraph.add_run(sup_text)
|
||||
run.font.superscript = True
|
||||
run.font.name = font_name
|
||||
run.font.size = font_size
|
||||
run.bold = bold
|
||||
run.italic = italic
|
||||
pos = m.end()
|
||||
if pos < len(text):
|
||||
_emit_plain(paragraph, text[pos:],
|
||||
font_name, font_size, bold, italic)
|
||||
|
||||
|
||||
def add_text_with_subsup(paragraph, text, font_name="Times New Roman",
|
||||
font_size=Pt(10), bold=False, italic=False):
|
||||
"""Add `text` to `paragraph`. Subscript/superscript handling is scoped to
|
||||
math regions delimited by MATH_START / MATH_END sentinels (set up by
|
||||
`latex_to_unicode`). Outside math regions, underscores and carets are
|
||||
preserved literally so identifiers like `signature_analysis` and
|
||||
`paper_a_results_v3.md` survive intact.
|
||||
"""
|
||||
if MATH_START not in text:
|
||||
_emit_math(paragraph, text, font_name, font_size, bold, italic) \
|
||||
if False else \
|
||||
_emit_plain(paragraph, text, font_name, font_size, bold, italic)
|
||||
return
|
||||
|
||||
pos = 0
|
||||
while pos < len(text):
|
||||
s = text.find(MATH_START, pos)
|
||||
if s == -1:
|
||||
_emit_plain(paragraph, text[pos:],
|
||||
font_name, font_size, bold, italic)
|
||||
break
|
||||
if s > pos:
|
||||
_emit_plain(paragraph, text[pos:s],
|
||||
font_name, font_size, bold, italic)
|
||||
e = text.find(MATH_END, s + 1)
|
||||
if e == -1:
|
||||
# Unterminated math region — emit rest as plain.
|
||||
_emit_plain(paragraph, text[s + 1:],
|
||||
font_name, font_size, bold, italic)
|
||||
break
|
||||
math_body = text[s + 1:e]
|
||||
_emit_math(paragraph, math_body, font_name, font_size, bold, italic)
|
||||
pos = e + 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Display-equation rendering (matplotlib mathtext → PNG → embedded image)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# matplotlib mathtext is a subset of LaTeX. A few common TeX-only macros need
|
||||
# to be substituted with mathtext-supported equivalents before parsing.
|
||||
_MATHTEXT_SUBS = [
|
||||
(re.compile(r"\\tfrac\b"), r"\\frac"), # text-frac → frac
|
||||
(re.compile(r"\\dfrac\b"), r"\\frac"), # display-frac → frac
|
||||
(re.compile(r"\\operatorname\{([^{}]+)\}"),
|
||||
lambda m: r"\mathrm{" + m.group(1) + "}"), # operatorname → mathrm
|
||||
(re.compile(r"\\,"), " "), # thin space
|
||||
(re.compile(r"\\;"), " "),
|
||||
(re.compile(r"\\!"), ""),
|
||||
]
|
||||
|
||||
|
||||
def _sanitise_for_mathtext(latex: str) -> str:
|
||||
out = latex
|
||||
for pat, repl in _MATHTEXT_SUBS:
|
||||
out = pat.sub(repl, out)
|
||||
return out
|
||||
|
||||
|
||||
def render_equation_png(latex: str, fontsize: int = 14) -> Path:
|
||||
"""Render a LaTeX math expression to a tightly-cropped PNG using
|
||||
matplotlib mathtext, with content-addressed caching so a re-build only
|
||||
re-renders changed equations. Returns the cached PNG path."""
|
||||
sanitised = _sanitise_for_mathtext(latex.strip())
|
||||
digest = hashlib.sha1(
|
||||
(sanitised + f"|fs{fontsize}").encode("utf-8")).hexdigest()[:16]
|
||||
out_path = EQUATION_CACHE_DIR / f"eq_{digest}.png"
|
||||
if out_path.exists():
|
||||
return out_path
|
||||
fig = plt.figure(figsize=(8, 1.6))
|
||||
fig.text(0.5, 0.5, f"${sanitised}$",
|
||||
fontsize=fontsize, ha="center", va="center")
|
||||
fig.savefig(str(out_path), dpi=220, bbox_inches="tight",
|
||||
pad_inches=0.05)
|
||||
plt.close(fig)
|
||||
return out_path
|
||||
|
||||
|
||||
def add_equation_block(doc, latex: str, equation_number: int,
|
||||
width_inches: float = 4.5):
|
||||
"""Insert a centered display equation (rendered as PNG) followed by
|
||||
a right-aligned equation number `(N)`. Width keeps the equation
|
||||
visually proportional within the IEEE Access body column."""
|
||||
img_path = render_equation_png(latex)
|
||||
p = doc.add_paragraph()
|
||||
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
p.paragraph_format.space_before = Pt(6)
|
||||
p.paragraph_format.space_after = Pt(6)
|
||||
run = p.add_run()
|
||||
run.add_picture(str(img_path), width=Inches(width_inches))
|
||||
# Equation number on the same paragraph, tab-aligned to the right.
|
||||
num_run = p.add_run(f"\t({equation_number})")
|
||||
num_run.font.name = "Times New Roman"
|
||||
num_run.font.size = Pt(10)
|
||||
|
||||
|
||||
def add_md_table(doc, table_lines):
|
||||
@@ -79,14 +400,23 @@ def add_md_table(doc, table_lines):
|
||||
for r_idx, row in enumerate(rows_data):
|
||||
for c_idx in range(min(len(row), ncols)):
|
||||
cell = table.rows[r_idx].cells[c_idx]
|
||||
cell.text = row[c_idx]
|
||||
for p in cell.paragraphs:
|
||||
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
for run in p.runs:
|
||||
run.font.size = Pt(8)
|
||||
run.font.name = "Times New Roman"
|
||||
if r_idx == 0:
|
||||
run.bold = True
|
||||
raw = row[c_idx]
|
||||
# Strip markdown emphasis markers; convert LaTeX before rendering.
|
||||
raw = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", raw)
|
||||
raw = re.sub(r"\*\*(.+?)\*\*", r"\1", raw)
|
||||
raw = re.sub(r"\*(.+?)\*", r"\1", raw)
|
||||
raw = re.sub(r"`(.+?)`", r"\1", raw)
|
||||
cell_text = latex_to_unicode(raw)
|
||||
# Replace the default empty paragraph with one we control.
|
||||
cell.text = ""
|
||||
cp = cell.paragraphs[0]
|
||||
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
add_text_with_subsup(
|
||||
cp, cell_text,
|
||||
font_name="Times New Roman",
|
||||
font_size=Pt(8),
|
||||
bold=(r_idx == 0),
|
||||
)
|
||||
doc.add_paragraph()
|
||||
|
||||
|
||||
@@ -105,10 +435,27 @@ def _insert_figures(doc, para_text):
|
||||
cr.italic = True
|
||||
|
||||
|
||||
def process_section(doc, filepath):
|
||||
def process_section(doc, filepath, equation_counter=None):
|
||||
"""Process one v3 markdown section. `equation_counter` is a single-element
|
||||
list (used as a mutable counter shared across sections) tracking the
|
||||
running display-equation number."""
|
||||
if equation_counter is None:
|
||||
equation_counter = [0]
|
||||
text = filepath.read_text(encoding="utf-8")
|
||||
text = strip_comments(text)
|
||||
lines = text.split("\n")
|
||||
# Defensive blockquote handling: markdown blockquote lines (`> body`) are
|
||||
# not rendered as Word callout blocks here, but stripping the leading
|
||||
# `> ` keeps the body text from leaking the literal `>` and the empty
|
||||
# `>` separator lines into the DOCX.
|
||||
cleaned = []
|
||||
for ln in lines:
|
||||
s = ln.lstrip()
|
||||
if s == ">" or s.startswith("> "):
|
||||
cleaned.append(ln[ln.index(">") + 1:].lstrip() if "> " in ln else "")
|
||||
else:
|
||||
cleaned.append(ln)
|
||||
lines = cleaned
|
||||
i = 0
|
||||
while i < len(lines):
|
||||
line = lines[i]
|
||||
@@ -117,23 +464,44 @@ def process_section(doc, filepath):
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
h = doc.add_heading(stripped[2:], level=1)
|
||||
h = doc.add_heading(
|
||||
latex_to_unicode(stripped[2:]).replace(MATH_START, "").replace(MATH_END, ""),
|
||||
level=1)
|
||||
for run in h.runs:
|
||||
run.font.color.rgb = RGBColor(0, 0, 0)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("## "):
|
||||
h = doc.add_heading(stripped[3:], level=2)
|
||||
h = doc.add_heading(
|
||||
latex_to_unicode(stripped[3:]).replace(MATH_START, "").replace(MATH_END, ""),
|
||||
level=2)
|
||||
for run in h.runs:
|
||||
run.font.color.rgb = RGBColor(0, 0, 0)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("### "):
|
||||
h = doc.add_heading(stripped[4:], level=3)
|
||||
h = doc.add_heading(
|
||||
latex_to_unicode(stripped[4:]).replace(MATH_START, "").replace(MATH_END, ""),
|
||||
level=3)
|
||||
for run in h.runs:
|
||||
run.font.color.rgb = RGBColor(0, 0, 0)
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("__TABLE_CAPTION__:"):
|
||||
caption_text = stripped[len("__TABLE_CAPTION__:"):].strip()
|
||||
caption_text = latex_to_unicode(caption_text)
|
||||
cp = doc.add_paragraph()
|
||||
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
cp.paragraph_format.space_before = Pt(6)
|
||||
cp.paragraph_format.space_after = Pt(2)
|
||||
add_text_with_subsup(
|
||||
cp, caption_text,
|
||||
font_name="Times New Roman",
|
||||
font_size=Pt(9),
|
||||
bold=True,
|
||||
)
|
||||
i += 1
|
||||
continue
|
||||
if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
|
||||
table_lines = []
|
||||
while i < len(lines) and "|" in lines[i]:
|
||||
@@ -141,22 +509,74 @@ def process_section(doc, filepath):
|
||||
i += 1
|
||||
add_md_table(doc, table_lines)
|
||||
continue
|
||||
# Display math: a line starting with `$$` is treated as a single-line
|
||||
# equation block and rendered as an embedded mathtext PNG with an
|
||||
# auto-incrementing equation number.
|
||||
if stripped.startswith("$$"):
|
||||
# Accumulate until a closing $$ is found (single line in our
|
||||
# corpus, but defensively support multi-line just in case).
|
||||
buf = [stripped]
|
||||
if not (stripped.count("$$") >= 2 and stripped.endswith("$$")):
|
||||
while i + 1 < len(lines):
|
||||
i += 1
|
||||
buf.append(lines[i])
|
||||
if "$$" in lines[i]:
|
||||
break
|
||||
joined = "\n".join(buf).strip()
|
||||
# Strip the leading and trailing $$ delimiters and any trailing
|
||||
# punctuation (e.g. the `,` that some equation lines end with).
|
||||
inner = joined
|
||||
if inner.startswith("$$"):
|
||||
inner = inner[2:]
|
||||
if inner.endswith("$$"):
|
||||
inner = inner[:-2]
|
||||
inner = inner.rstrip(", ")
|
||||
equation_counter[0] += 1
|
||||
try:
|
||||
add_equation_block(doc, inner, equation_counter[0])
|
||||
except Exception as exc:
|
||||
# Fallback: render as plain centered Times-Roman line so the
|
||||
# build doesn't fail on a single un-renderable equation.
|
||||
p = doc.add_paragraph()
|
||||
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
run = p.add_run(f"[equation render failed: {exc}] {inner}")
|
||||
run.font.name = "Times New Roman"
|
||||
run.font.size = Pt(10)
|
||||
run.italic = True
|
||||
i += 1
|
||||
continue
|
||||
if re.match(r"^\d+\.\s", stripped):
|
||||
p = doc.add_paragraph(style="List Number")
|
||||
content = re.sub(r"^\d+\.\s", "", stripped)
|
||||
# Manual numbering: keep the number from the markdown source and
|
||||
# apply a hanging-indent paragraph format. Avoids python-docx's
|
||||
# `style='List Number'` which depends on a properly-set-up
|
||||
# numbering definition that the default Document() lacks.
|
||||
m = re.match(r"^(\d+)\.\s+(.*)$", stripped)
|
||||
num, content = m.group(1), m.group(2)
|
||||
p = doc.add_paragraph()
|
||||
p.paragraph_format.left_indent = Inches(0.4)
|
||||
p.paragraph_format.first_line_indent = Inches(-0.25)
|
||||
p.paragraph_format.space_after = Pt(4)
|
||||
content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
|
||||
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
|
||||
run = p.add_run(content)
|
||||
run.font.size = Pt(10)
|
||||
run.font.name = "Times New Roman"
|
||||
content = re.sub(r"\*(.+?)\*", r"\1", content)
|
||||
content = re.sub(r"`(.+?)`", r"\1", content)
|
||||
content = latex_to_unicode(content)
|
||||
add_text_with_subsup(p, f"{num}. {content}")
|
||||
i += 1
|
||||
continue
|
||||
if stripped.startswith("- "):
|
||||
p = doc.add_paragraph(style="List Bullet")
|
||||
# Manual bullets with hanging indent (same rationale as numbered).
|
||||
p = doc.add_paragraph()
|
||||
p.paragraph_format.left_indent = Inches(0.4)
|
||||
p.paragraph_format.first_line_indent = Inches(-0.25)
|
||||
p.paragraph_format.space_after = Pt(4)
|
||||
content = stripped[2:]
|
||||
content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
|
||||
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
|
||||
run = p.add_run(content)
|
||||
run.font.size = Pt(10)
|
||||
run.font.name = "Times New Roman"
|
||||
content = re.sub(r"\*(.+?)\*", r"\1", content)
|
||||
content = re.sub(r"`(.+?)`", r"\1", content)
|
||||
content = latex_to_unicode(content)
|
||||
add_text_with_subsup(p, f"• {content}")
|
||||
i += 1
|
||||
continue
|
||||
# Regular paragraph
|
||||
@@ -179,14 +599,12 @@ def process_section(doc, filepath):
|
||||
para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
|
||||
para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
|
||||
para_text = re.sub(r"`(.+?)`", r"\1", para_text)
|
||||
para_text = para_text.replace("$$", "")
|
||||
para_text = para_text.replace("---", "\u2014")
|
||||
para_text = latex_to_unicode(para_text)
|
||||
|
||||
p = doc.add_paragraph()
|
||||
p.paragraph_format.space_after = Pt(6)
|
||||
run = p.add_run(para_text)
|
||||
run.font.size = Pt(10)
|
||||
run.font.name = "Times New Roman"
|
||||
add_text_with_subsup(p, para_text)
|
||||
|
||||
_insert_figures(doc, para_text)
|
||||
|
||||
@@ -234,15 +652,38 @@ def main():
|
||||
run.font.size = Pt(10)
|
||||
run.italic = True
|
||||
|
||||
equation_counter = [0]
|
||||
for section_file in SECTIONS:
|
||||
filepath = PAPER_DIR / section_file
|
||||
if filepath.exists():
|
||||
process_section(doc, filepath)
|
||||
process_section(doc, filepath, equation_counter=equation_counter)
|
||||
else:
|
||||
print(f"WARNING: missing section file: {filepath}")
|
||||
|
||||
doc.save(str(OUTPUT))
|
||||
print(f"Saved: {OUTPUT}")
|
||||
_run_linter()
|
||||
|
||||
|
||||
def _run_linter():
|
||||
"""Run the leak linter on the freshly built DOCX. Non-fatal: prints a
|
||||
summary line. For full output run `python3 paper/lint_paper_v3.py`."""
|
||||
try:
|
||||
import lint_paper_v3 # local module
|
||||
except Exception as exc: # pragma: no cover
|
||||
print(f"(lint skipped: {exc})")
|
||||
return
|
||||
findings = lint_paper_v3.lint_docx(OUTPUT)
|
||||
errors = sum(1 for f in findings if f.severity == "ERROR")
|
||||
warns = sum(1 for f in findings if f.severity == "WARN")
|
||||
infos = sum(1 for f in findings if f.severity == "INFO")
|
||||
if errors:
|
||||
print(f"\n[lint] {errors} ERROR finding(s) in DOCX — run "
|
||||
f"`python3 paper/lint_paper_v3.py --docx` for details.")
|
||||
elif warns or infos:
|
||||
print(f"[lint] DOCX clean of ERRORs ({warns} WARN, {infos} INFO).")
|
||||
else:
|
||||
print("[lint] DOCX clean.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user