Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1): Must-fix items (6/6): - §III-F SSIM/pixel rejection rewritten from first principles (design-level argument from luminance/contrast/structure local-window product, not the prior empirical 0.70 result) - Table VI restructured by population × method; added missing Firm A logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary marked bin-unstable (Appendix A) - Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted from "operational dual" to "calibration-fold-adjacent reference"; the actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout - New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A on top); script 30_yearly_big4_comparison.py - Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets - §III-K reframed P7.5 from "round-number lower-tail boundary" to operating point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds: 0.9407 / 0.945 / 0.95 / 0.977 / 0.985) Nice-to-have items (3/3): - Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985) - Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to table notes; cut "invite reviewer skepticism" and "non-load-bearing" Codex 3-pass verification cleanup: - Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A Beta-2 forced-fit crossing from beta_mixture_results.json) - dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer immediately below) instead of misleading "rounded down" - Table XII-B prose corrected: per-segment qualification of "non-Firm-A capture falls faster" (true on 0.95→0.977 segment but contracts on 0.977→0.985 segment); arithmetic now from exact counts Within-year analyses removed: - Within-year ranking robustness check (Class A) was added in nice-to-have pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the Appendix B provenance row - Within-CPA future-work disclosures (Class B) removed from Discussion limitation #5 and Conclusion future-work paragraph; subsequent limitations renumbered Sixth → Fifth, Seventh → Sixth DOCX rendering pipeline overhaul (paper/export_v3.py): Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES: strip_comments() was wholesale-deleting HTML comments, but every numerical table is wrapped in , so the table body was deleted alongside the wrapper. Now unwraps TABLE comments (emit synthetic __TABLE_CAPTION__: marker + table body) while still stripping non-TABLE editorial comments. Result: 19 tables now render in the DOCX. Other rendering fixes: - LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥, ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,}) - Math-context-scoped sub/superscript via PUA sentinels (/): no more underscore-eating in identifiers like signature_analysis - Display equations rendered via matplotlib mathtext to PNG (3 equations: cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as numbered equation blocks (1), (2), (3); content-addressed cache at paper/equations/ (gitignored, regenerable) - Manual numbered/bulleted list rendering with hanging indent (replaces python-docx style="List Number" which silently drops the number prefix when no numbering definition is bound) - Markdown blockquote (> ...) defensively stripped - Pandoc footnote ([^name]) markers no longer leak (inlined at source) - Heading text cleaned of LaTeX residue + PUA sentinels - File paths in body text (signature_analysis/X.py, reports/Y.json) trimmed to "(reproduction artifact in Appendix B)" pointers New leak linter: paper/lint_paper_v3.py - two-pass markdown source + rendered DOCX leak detector; auto-runs at end of export_v3.py. Script changes: - 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR threshold list so Table XII-B is reproducible from persisted JSON - 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly data (writes to reports/figures/ and reports/firm_yearly_comparison/) - 31_within_year_ranking_robustness.py: NEW; supports the within-year robustness check (no longer cited in paper but kept as repo-internal due-diligence artifact) Partner handoff DOCX shipped to ~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB: 19 tables + 4 figures + 3 equation images). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:44:49 +08:00
parent 623eb4cd4b
commit 53125d11d9
13 changed files with 1554 additions and 112 deletions
@@ -5,9 +5,16 @@ from docx import Document
 from docx.shared import Inches, Pt, RGBColor
 from docx.enum.text import WD_ALIGN_PARAGRAPH
 from pathlib import Path
+import hashlib
 import re

+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+
 PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
+EQUATION_CACHE_DIR = PAPER_DIR / "equations"
+EQUATION_CACHE_DIR.mkdir(exist_ok=True)
 FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
 EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
 OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
@@ -48,10 +55,10 @@ FIGURES = {
        "Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
        3.5,
    ),
-    "Fig. 4 visualizes the accountant-level clusters": (
-        EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
-        "Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
-        4.5,
+    "Fig. 4 summarises the per-firm yearly per-signature": (
+        EXTRA_FIG_DIR / "figures" / "fig_yearly_big4_comparison.png",
+        "Fig. 4. Per-firm yearly per-signature best-match cosine, 2013-2023. (a) Mean per-signature best-match cosine by firm bucket and fiscal year (threshold-free). (b) Share of per-signature best-match cosine ≥ 0.95 (operational cut of Section III-K). Five lines: Firm A, B, C, D, Non-Big-4. Firm A is above the other Big-4 firms in every year; Non-Big-4 is below all four Big-4 firms in every year.",
+        6.5,
    ),
    "conducted an ablation study comparing three": (
        FIG_DIR / "fig4_ablation.png",
@@ -62,7 +69,321 @@ FIGURES = {


 def strip_comments(text):
-    return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
+    """Remove HTML comments, but UNWRAP comments whose first non-blank line
+    starts with `TABLE ` (or `TABLE\t`).
+
+    The v3 markdown sources wrap every numerical table in an HTML comment of
+    the form
+
+        <!-- TABLE V: Hartigan Dip Test Results
+        | Distribution | N | ... |
+        |--------------|---|-----|
+        | ...          | … | ... |
+        -->
+
+    The caption (`TABLE V: Hartigan Dip Test Results`) is on the same line as
+    the opening `<!--`, the markdown table body is on the lines following,
+    and `-->` closes the block. The previous implementation wholesale-deleted
+    these comments, which silently dropped every table from the rendered
+    DOCX. We now (i) detect comments whose first non-empty line starts with
+    `TABLE `, (ii) emit a synthetic caption marker line `__TABLE_CAPTION__:
+    <caption>` so process_section can render the caption as a centered
+    bold paragraph above the table, and (iii) keep the table body so the
+    existing markdown-table detector picks it up. Non-TABLE comments
+    (figure placeholders, editorial notes) are stripped as before.
+    """
+    def _replace(match):
+        body = match.group(1)
+        # Find first non-blank line.
+        for line in body.splitlines():
+            stripped = line.strip()
+            if stripped:
+                first = stripped
+                break
+        else:
+            return ""
+        if not first.startswith("TABLE ") and not first.startswith("TABLE\t"):
+            return ""
+        # Split caption (first non-blank line) from the rest.
+        lines = body.splitlines()
+        # Find index of the first non-blank line and use everything after.
+        for idx, line in enumerate(lines):
+            if line.strip():
+                caption = line.strip()
+                rest = "\n".join(lines[idx + 1:])
+                break
+        else:
+            return ""
+        # Emit caption marker + body. Surround with blank lines so the
+        # paragraph/table detector treats the marker as its own paragraph.
+        return f"\n\n__TABLE_CAPTION__:{caption}\n{rest}\n"
+    # Non-greedy match across lines.
+    return re.sub(r"<!--(.*?)-->", _replace, text, flags=re.DOTALL)
+
+
+# ---------------------------------------------------------------------------
+# LaTeX → plain text + Unicode conversion
+# ---------------------------------------------------------------------------
+# The v3 markdown sources contain inline LaTeX ($...$) and a small number of
+# display-math blocks ($$...$$). Pandoc would render these natively; the
+# python-docx pipeline used here does not, so without preprocessing every
+# `\leq`, `\text{dHash}_\text{indep}`, `\Delta\text{BIC}`, `60{,}448`, etc.
+# leaks into the DOCX as raw LaTeX. The helpers below convert the common
+# inline cases to Unicode and split subscripts/superscripts into proper Word
+# runs. Display-math (rare; 3 equations in this paper) gets a best-effort
+# linearisation and is acceptable for a partner-handoff DOCX; final IEEE
+# typesetting is handled by the publisher's LaTeX/MathType pipeline.
+
+LATEX_TOKEN_REPLACEMENTS = [
+    # Greek letters (lower)
+    (r"\\alpha(?![A-Za-z])", "α"), (r"\\beta(?![A-Za-z])", "β"), (r"\\gamma(?![A-Za-z])", "γ"),
+    (r"\\delta(?![A-Za-z])", "δ"), (r"\\epsilon(?![A-Za-z])", "ε"), (r"\\zeta(?![A-Za-z])", "ζ"),
+    (r"\\eta(?![A-Za-z])", "η"), (r"\\theta(?![A-Za-z])", "θ"), (r"\\iota(?![A-Za-z])", "ι"),
+    (r"\\kappa(?![A-Za-z])", "κ"), (r"\\lambda(?![A-Za-z])", "λ"), (r"\\mu(?![A-Za-z])", "μ"),
+    (r"\\nu(?![A-Za-z])", "ν"), (r"\\xi(?![A-Za-z])", "ξ"), (r"\\pi(?![A-Za-z])", "π"),
+    (r"\\rho(?![A-Za-z])", "ρ"), (r"\\sigma(?![A-Za-z])", "σ"), (r"\\tau(?![A-Za-z])", "τ"),
+    (r"\\phi(?![A-Za-z])", "φ"), (r"\\chi(?![A-Za-z])", "χ"), (r"\\psi(?![A-Za-z])", "ψ"),
+    (r"\\omega(?![A-Za-z])", "ω"),
+    # Greek letters (upper, only those distinguishable from Latin)
+    (r"\\Gamma(?![A-Za-z])", "Γ"), (r"\\Delta(?![A-Za-z])", "Δ"), (r"\\Theta(?![A-Za-z])", "Θ"),
+    (r"\\Lambda(?![A-Za-z])", "Λ"), (r"\\Xi(?![A-Za-z])", "Ξ"), (r"\\Pi(?![A-Za-z])", "Π"),
+    (r"\\Sigma(?![A-Za-z])", "Σ"), (r"\\Phi(?![A-Za-z])", "Φ"), (r"\\Psi(?![A-Za-z])", "Ψ"),
+    (r"\\Omega(?![A-Za-z])", "Ω"),
+    # Relations / arrows
+    (r"\\leq(?![A-Za-z])", "≤"), (r"\\geq(?![A-Za-z])", "≥"),
+    (r"\\neq(?![A-Za-z])", "≠"), (r"\\approx(?![A-Za-z])", "≈"),
+    (r"\\equiv(?![A-Za-z])", "≡"), (r"\\sim(?![A-Za-z])", "~"),
+    (r"\\to(?![A-Za-z])", "→"), (r"\\rightarrow(?![A-Za-z])", "→"),
+    (r"\\leftarrow(?![A-Za-z])", "←"), (r"\\Rightarrow(?![A-Za-z])", "⇒"),
+    (r"\\Leftarrow(?![A-Za-z])", "⇐"),
+    # Binary operators
+    (r"\\times(?![A-Za-z])", "×"), (r"\\cdot(?![A-Za-z])", "·"),
+    (r"\\pm(?![A-Za-z])", "±"), (r"\\mp(?![A-Za-z])", "∓"),
+    (r"\\div(?![A-Za-z])", "÷"),
+    # Misc
+    (r"\\infty(?![A-Za-z])", "∞"), (r"\\partial(?![A-Za-z])", "∂"),
+    (r"\\sum(?![A-Za-z])", "∑"), (r"\\prod(?![A-Za-z])", "∏"),
+    (r"\\int(?![A-Za-z])", "∫"),
+    (r"\\ldots(?![A-Za-z])", "…"), (r"\\dots(?![A-Za-z])", "…"),
+    # Spacing commands (drop or replace with single space)
+    (r"\\,", " "), (r"\\;", " "), (r"\\:", " "),
+    (r"\\!", ""), (r"\\ ", " "),
+    (r"\\quad(?![A-Za-z])", "  "), (r"\\qquad(?![A-Za-z])", "    "),
+    # Escaped punctuation
+    (r"\\%", "%"), (r"\\#", "#"), (r"\\&", "&"),
+    (r"\\\$", "$"), (r"\\_", "_"),
+]
+
+
+def _unwrap_command(text, cmd):
+    """Repeatedly replace `\\cmd{X}` → `X` until stable."""
+    pat = re.compile(r"\\" + cmd + r"\{([^{}]*)\}")
+    prev = None
+    while prev != text:
+        prev = text
+        text = pat.sub(r"\1", text)
+    return text
+
+
+MATH_START = ""  # Private Use Area: XML-safe
+MATH_END = ""
+
+
+def latex_to_unicode(text):
+    """Convert a LaTeX-laced markdown paragraph into plain text.
+
+    Math context is preserved with private-use sentinel characters
+    (MATH_START / MATH_END) so the downstream run-splitter only treats
+    `_X` / `^X` as subscript / superscript inside math regions; in body
+    text underscores in identifiers like `signature_analysis` survive.
+    """
+    if "$" not in text and "\\" not in text:
+        return text
+
+    # 1. Strip display-math delimiters first (keep the inner content for
+    #    best-effort linearisation), wrapping math regions with sentinels.
+    #    Then strip inline math delimiters with the same sentinel wrapping.
+    text = re.sub(r"\$\$([\s\S]+?)\$\$",
+                  lambda m: MATH_START + m.group(1) + MATH_END, text)
+    text = re.sub(r"\$([^$]+?)\$",
+                  lambda m: MATH_START + m.group(1) + MATH_END, text)
+
+    # 2. Replace token-level commands with Unicode glyphs *before* unwrapping
+    #    `\text{...}` and friends, so that `\Delta\text{BIC}` becomes
+    #    `Δ\text{BIC}` (then `ΔBIC`) rather than `\DeltaBIC` which would be
+    #    stripped wholesale by the cleanup pass.
+    for pat, repl in LATEX_TOKEN_REPLACEMENTS:
+        text = re.sub(pat, repl, text)
+
+    # 3. Unwrap formatting / text commands (innermost first via _unwrap loop).
+    for cmd in ("text", "mathbf", "mathit", "mathrm", "mathsf", "mathtt",
+                "operatorname", "emph", "textbf", "textit"):
+        text = _unwrap_command(text, cmd)
+
+    # 4. \frac{a}{b} → (a)/(b); \sqrt{x} → √(x). Apply repeatedly to handle
+    #    one level of nesting; deeper nesting is rare in this paper.
+    for _ in range(3):
+        text = re.sub(
+            r"\\t?frac\{([^{}]+)\}\{([^{}]+)\}",
+            r"(\1)/(\2)",
+            text,
+        )
+    text = re.sub(r"\\sqrt\{([^{}]+)\}", r"√(\1)", text)
+
+    # 5. TeX braces used purely for spacing/grouping: K{=}3 → K=3,
+    #    60{,}448 → 60,448, 10{,}175 → 10,175.
+    text = re.sub(r"\{([=<>+\-,])\}", r"\1", text)
+
+    # 6. Strip any remaining `\cmd{...}` (best effort) and `\cmd ` tokens.
+    text = re.sub(r"\\[a-zA-Z]+\{([^{}]*)\}", r"\1", text)
+    text = re.sub(r"\\[a-zA-Z]+(?![A-Za-z])", "", text)
+
+    # 7. Collapse runs of whitespace introduced by command stripping.
+    text = re.sub(r"[ \t]{2,}", " ", text)
+    return text
+
+
+_SUBSUP_PATTERN = re.compile(
+    r"_\{([^{}]*)\}"     # _{...}
+    r"|\^\{([^{}]*)\}"   # ^{...}
+    r"|_([A-Za-z0-9+\-])"  # _X (single token)
+    r"|\^([A-Za-z0-9+\-])"  # ^X (single token)
+)
+
+
+def _emit_plain(paragraph, text, font_name, font_size, bold, italic):
+    if not text:
+        return
+    run = paragraph.add_run(text)
+    run.font.name = font_name
+    run.font.size = font_size
+    run.bold = bold
+    run.italic = italic
+
+
+def _emit_math(paragraph, text, font_name, font_size, bold, italic):
+    """Emit `text` from a math region: split on `_X` / `_{X}` / `^X` / `^{X}`
+    and render those as Word subscripts / superscripts."""
+    if "_" not in text and "^" not in text:
+        _emit_plain(paragraph, text, font_name, font_size, bold, italic)
+        return
+    pos = 0
+    for m in _SUBSUP_PATTERN.finditer(text):
+        if m.start() > pos:
+            _emit_plain(paragraph, text[pos:m.start()],
+                        font_name, font_size, bold, italic)
+        sub_text = m.group(1) or m.group(3)
+        sup_text = m.group(2) or m.group(4)
+        if sub_text is not None:
+            run = paragraph.add_run(sub_text)
+            run.font.subscript = True
+        else:
+            run = paragraph.add_run(sup_text)
+            run.font.superscript = True
+        run.font.name = font_name
+        run.font.size = font_size
+        run.bold = bold
+        run.italic = italic
+        pos = m.end()
+    if pos < len(text):
+        _emit_plain(paragraph, text[pos:],
+                    font_name, font_size, bold, italic)
+
+
+def add_text_with_subsup(paragraph, text, font_name="Times New Roman",
+                         font_size=Pt(10), bold=False, italic=False):
+    """Add `text` to `paragraph`. Subscript/superscript handling is scoped to
+    math regions delimited by MATH_START / MATH_END sentinels (set up by
+    `latex_to_unicode`). Outside math regions, underscores and carets are
+    preserved literally so identifiers like `signature_analysis` and
+    `paper_a_results_v3.md` survive intact.
+    """
+    if MATH_START not in text:
+        _emit_math(paragraph, text, font_name, font_size, bold, italic) \
+            if False else \
+            _emit_plain(paragraph, text, font_name, font_size, bold, italic)
+        return
+
+    pos = 0
+    while pos < len(text):
+        s = text.find(MATH_START, pos)
+        if s == -1:
+            _emit_plain(paragraph, text[pos:],
+                        font_name, font_size, bold, italic)
+            break
+        if s > pos:
+            _emit_plain(paragraph, text[pos:s],
+                        font_name, font_size, bold, italic)
+        e = text.find(MATH_END, s + 1)
+        if e == -1:
+            # Unterminated math region — emit rest as plain.
+            _emit_plain(paragraph, text[s + 1:],
+                        font_name, font_size, bold, italic)
+            break
+        math_body = text[s + 1:e]
+        _emit_math(paragraph, math_body, font_name, font_size, bold, italic)
+        pos = e + 1
+
+
+# ---------------------------------------------------------------------------
+# Display-equation rendering (matplotlib mathtext → PNG → embedded image)
+# ---------------------------------------------------------------------------
+
+# matplotlib mathtext is a subset of LaTeX. A few common TeX-only macros need
+# to be substituted with mathtext-supported equivalents before parsing.
+_MATHTEXT_SUBS = [
+    (re.compile(r"\\tfrac\b"), r"\\frac"),       # text-frac → frac
+    (re.compile(r"\\dfrac\b"), r"\\frac"),       # display-frac → frac
+    (re.compile(r"\\operatorname\{([^{}]+)\}"),
+     lambda m: r"\mathrm{" + m.group(1) + "}"),  # operatorname → mathrm
+    (re.compile(r"\\,"), " "),                   # thin space
+    (re.compile(r"\\;"), " "),
+    (re.compile(r"\\!"), ""),
+]
+
+
+def _sanitise_for_mathtext(latex: str) -> str:
+    out = latex
+    for pat, repl in _MATHTEXT_SUBS:
+        out = pat.sub(repl, out)
+    return out
+
+
+def render_equation_png(latex: str, fontsize: int = 14) -> Path:
+    """Render a LaTeX math expression to a tightly-cropped PNG using
+    matplotlib mathtext, with content-addressed caching so a re-build only
+    re-renders changed equations. Returns the cached PNG path."""
+    sanitised = _sanitise_for_mathtext(latex.strip())
+    digest = hashlib.sha1(
+        (sanitised + f"|fs{fontsize}").encode("utf-8")).hexdigest()[:16]
+    out_path = EQUATION_CACHE_DIR / f"eq_{digest}.png"
+    if out_path.exists():
+        return out_path
+    fig = plt.figure(figsize=(8, 1.6))
+    fig.text(0.5, 0.5, f"${sanitised}$",
+             fontsize=fontsize, ha="center", va="center")
+    fig.savefig(str(out_path), dpi=220, bbox_inches="tight",
+                pad_inches=0.05)
+    plt.close(fig)
+    return out_path
+
+
+def add_equation_block(doc, latex: str, equation_number: int,
+                       width_inches: float = 4.5):
+    """Insert a centered display equation (rendered as PNG) followed by
+    a right-aligned equation number `(N)`. Width keeps the equation
+    visually proportional within the IEEE Access body column."""
+    img_path = render_equation_png(latex)
+    p = doc.add_paragraph()
+    p.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    p.paragraph_format.space_before = Pt(6)
+    p.paragraph_format.space_after = Pt(6)
+    run = p.add_run()
+    run.add_picture(str(img_path), width=Inches(width_inches))
+    # Equation number on the same paragraph, tab-aligned to the right.
+    num_run = p.add_run(f"\t({equation_number})")
+    num_run.font.name = "Times New Roman"
+    num_run.font.size = Pt(10)


 def add_md_table(doc, table_lines):
@@ -79,14 +400,23 @@ def add_md_table(doc, table_lines):
    for r_idx, row in enumerate(rows_data):
        for c_idx in range(min(len(row), ncols)):
            cell = table.rows[r_idx].cells[c_idx]
-            cell.text = row[c_idx]
-            for p in cell.paragraphs:
-                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
-                for run in p.runs:
-                    run.font.size = Pt(8)
-                    run.font.name = "Times New Roman"
-                    if r_idx == 0:
-                        run.bold = True
+            raw = row[c_idx]
+            # Strip markdown emphasis markers; convert LaTeX before rendering.
+            raw = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", raw)
+            raw = re.sub(r"\*\*(.+?)\*\*", r"\1", raw)
+            raw = re.sub(r"\*(.+?)\*", r"\1", raw)
+            raw = re.sub(r"`(.+?)`", r"\1", raw)
+            cell_text = latex_to_unicode(raw)
+            # Replace the default empty paragraph with one we control.
+            cell.text = ""
+            cp = cell.paragraphs[0]
+            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
+            add_text_with_subsup(
+                cp, cell_text,
+                font_name="Times New Roman",
+                font_size=Pt(8),
+                bold=(r_idx == 0),
+            )
    doc.add_paragraph()


@@ -105,10 +435,27 @@ def _insert_figures(doc, para_text):
            cr.italic = True


-def process_section(doc, filepath):
+def process_section(doc, filepath, equation_counter=None):
+    """Process one v3 markdown section. `equation_counter` is a single-element
+    list (used as a mutable counter shared across sections) tracking the
+    running display-equation number."""
+    if equation_counter is None:
+        equation_counter = [0]
    text = filepath.read_text(encoding="utf-8")
    text = strip_comments(text)
    lines = text.split("\n")
+    # Defensive blockquote handling: markdown blockquote lines (`> body`) are
+    # not rendered as Word callout blocks here, but stripping the leading
+    # `> ` keeps the body text from leaking the literal `>` and the empty
+    # `>` separator lines into the DOCX.
+    cleaned = []
+    for ln in lines:
+        s = ln.lstrip()
+        if s == ">" or s.startswith("> "):
+            cleaned.append(ln[ln.index(">") + 1:].lstrip() if "> " in ln else "")
+        else:
+            cleaned.append(ln)
+    lines = cleaned
    i = 0
    while i < len(lines):
        line = lines[i]
@@ -117,23 +464,44 @@ def process_section(doc, filepath):
            i += 1
            continue
        if stripped.startswith("# "):
-            h = doc.add_heading(stripped[2:], level=1)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[2:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=1)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("## "):
-            h = doc.add_heading(stripped[3:], level=2)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[3:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=2)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
        if stripped.startswith("### "):
-            h = doc.add_heading(stripped[4:], level=3)
+            h = doc.add_heading(
+                latex_to_unicode(stripped[4:]).replace(MATH_START, "").replace(MATH_END, ""),
+                level=3)
            for run in h.runs:
                run.font.color.rgb = RGBColor(0, 0, 0)
            i += 1
            continue
+        if stripped.startswith("__TABLE_CAPTION__:"):
+            caption_text = stripped[len("__TABLE_CAPTION__:"):].strip()
+            caption_text = latex_to_unicode(caption_text)
+            cp = doc.add_paragraph()
+            cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
+            cp.paragraph_format.space_before = Pt(6)
+            cp.paragraph_format.space_after = Pt(2)
+            add_text_with_subsup(
+                cp, caption_text,
+                font_name="Times New Roman",
+                font_size=Pt(9),
+                bold=True,
+            )
+            i += 1
+            continue
        if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
            table_lines = []
            while i < len(lines) and "|" in lines[i]:
@@ -141,22 +509,74 @@ def process_section(doc, filepath):
                i += 1
            add_md_table(doc, table_lines)
            continue
+        # Display math: a line starting with `$$` is treated as a single-line
+        # equation block and rendered as an embedded mathtext PNG with an
+        # auto-incrementing equation number.
+        if stripped.startswith("$$"):
+            # Accumulate until a closing $$ is found (single line in our
+            # corpus, but defensively support multi-line just in case).
+            buf = [stripped]
+            if not (stripped.count("$$") >= 2 and stripped.endswith("$$")):
+                while i + 1 < len(lines):
+                    i += 1
+                    buf.append(lines[i])
+                    if "$$" in lines[i]:
+                        break
+            joined = "\n".join(buf).strip()
+            # Strip the leading and trailing $$ delimiters and any trailing
+            # punctuation (e.g. the `,` that some equation lines end with).
+            inner = joined
+            if inner.startswith("$$"):
+                inner = inner[2:]
+            if inner.endswith("$$"):
+                inner = inner[:-2]
+            inner = inner.rstrip(", ")
+            equation_counter[0] += 1
+            try:
+                add_equation_block(doc, inner, equation_counter[0])
+            except Exception as exc:
+                # Fallback: render as plain centered Times-Roman line so the
+                # build doesn't fail on a single un-renderable equation.
+                p = doc.add_paragraph()
+                p.alignment = WD_ALIGN_PARAGRAPH.CENTER
+                run = p.add_run(f"[equation render failed: {exc}] {inner}")
+                run.font.name = "Times New Roman"
+                run.font.size = Pt(10)
+                run.italic = True
+            i += 1
+            continue
        if re.match(r"^\d+\.\s", stripped):
-            p = doc.add_paragraph(style="List Number")
-            content = re.sub(r"^\d+\.\s", "", stripped)
+            # Manual numbering: keep the number from the markdown source and
+            # apply a hanging-indent paragraph format. Avoids python-docx's
+            # `style='List Number'` which depends on a properly-set-up
+            # numbering definition that the default Document() lacks.
+            m = re.match(r"^(\d+)\.\s+(.*)$", stripped)
+            num, content = m.group(1), m.group(2)
+            p = doc.add_paragraph()
+            p.paragraph_format.left_indent = Inches(0.4)
+            p.paragraph_format.first_line_indent = Inches(-0.25)
+            p.paragraph_format.space_after = Pt(4)
+            content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
-            run = p.add_run(content)
-            run.font.size = Pt(10)
-            run.font.name = "Times New Roman"
+            content = re.sub(r"\*(.+?)\*", r"\1", content)
+            content = re.sub(r"`(.+?)`", r"\1", content)
+            content = latex_to_unicode(content)
+            add_text_with_subsup(p, f"{num}. {content}")
            i += 1
            continue
        if stripped.startswith("- "):
-            p = doc.add_paragraph(style="List Bullet")
+            # Manual bullets with hanging indent (same rationale as numbered).
+            p = doc.add_paragraph()
+            p.paragraph_format.left_indent = Inches(0.4)
+            p.paragraph_format.first_line_indent = Inches(-0.25)
+            p.paragraph_format.space_after = Pt(4)
            content = stripped[2:]
+            content = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", content)
            content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
-            run = p.add_run(content)
-            run.font.size = Pt(10)
-            run.font.name = "Times New Roman"
+            content = re.sub(r"\*(.+?)\*", r"\1", content)
+            content = re.sub(r"`(.+?)`", r"\1", content)
+            content = latex_to_unicode(content)
+            add_text_with_subsup(p, f"•  {content}")
            i += 1
            continue
        # Regular paragraph
@@ -179,14 +599,12 @@ def process_section(doc, filepath):
        para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
        para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
        para_text = re.sub(r"`(.+?)`", r"\1", para_text)
-        para_text = para_text.replace("$$", "")
        para_text = para_text.replace("---", "\u2014")
+        para_text = latex_to_unicode(para_text)

        p = doc.add_paragraph()
        p.paragraph_format.space_after = Pt(6)
-        run = p.add_run(para_text)
-        run.font.size = Pt(10)
-        run.font.name = "Times New Roman"
+        add_text_with_subsup(p, para_text)

        _insert_figures(doc, para_text)

@@ -234,15 +652,38 @@ def main():
    run.font.size = Pt(10)
    run.italic = True

+    equation_counter = [0]
    for section_file in SECTIONS:
        filepath = PAPER_DIR / section_file
        if filepath.exists():
-            process_section(doc, filepath)
+            process_section(doc, filepath, equation_counter=equation_counter)
        else:
            print(f"WARNING: missing section file: {filepath}")

    doc.save(str(OUTPUT))
    print(f"Saved: {OUTPUT}")
+    _run_linter()
+
+
+def _run_linter():
+    """Run the leak linter on the freshly built DOCX. Non-fatal: prints a
+    summary line. For full output run `python3 paper/lint_paper_v3.py`."""
+    try:
+        import lint_paper_v3  # local module
+    except Exception as exc:  # pragma: no cover
+        print(f"(lint skipped: {exc})")
+        return
+    findings = lint_paper_v3.lint_docx(OUTPUT)
+    errors = sum(1 for f in findings if f.severity == "ERROR")
+    warns = sum(1 for f in findings if f.severity == "WARN")
+    infos = sum(1 for f in findings if f.severity == "INFO")
+    if errors:
+        print(f"\n[lint] {errors} ERROR finding(s) in DOCX — run "
+              f"`python3 paper/lint_paper_v3.py --docx` for details.")
+    elif warns or infos:
+        print(f"[lint] DOCX clean of ERRORs ({warns} WARN, {infos} INFO).")
+    else:
+        print("[lint] DOCX clean.")


 if __name__ == "__main__":