Paper A v3.20.0: partner Jimmy 2026-04-27 review + DOCX rendering overhaul

Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1):

Must-fix items (6/6):
- §III-F SSIM/pixel rejection rewritten from first principles (design-level
  argument from luminance/contrast/structure local-window product, not the
  prior empirical 0.70 result)
- Table VI restructured by population × method; added missing Firm A
  logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary
  marked bin-unstable (Appendix A)
- Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted
  from "operational dual" to "calibration-fold-adjacent reference"; the
  actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout
- New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A
  on top); script 30_yearly_big4_comparison.py
- Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets
- §III-K reframed P7.5 from "round-number lower-tail boundary" to operating
  point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds:
  0.9407 / 0.945 / 0.95 / 0.977 / 0.985)

Nice-to-have items (3/3):
- Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985)
- Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to
  table notes; cut "invite reviewer skepticism" and "non-load-bearing"

Codex 3-pass verification cleanup:
- Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A
  Beta-2 forced-fit crossing from beta_mixture_results.json)
- dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer
  immediately below) instead of misleading "rounded down"
- Table XII-B prose corrected: per-segment qualification of "non-Firm-A
  capture falls faster" (true on 0.95→0.977 segment but contracts on
  0.977→0.985 segment); arithmetic now from exact counts

Within-year analyses removed:
- Within-year ranking robustness check (Class A) was added in nice-to-have
  pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the
  Appendix B provenance row
- Within-CPA future-work disclosures (Class B) removed from Discussion
  limitation #5 and Conclusion future-work paragraph; subsequent limitations
  renumbered Sixth → Fifth, Seventh → Sixth

DOCX rendering pipeline overhaul (paper/export_v3.py):

Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES:
strip_comments() was wholesale-deleting HTML comments, but every numerical
table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted
alongside the wrapper. Now unwraps TABLE comments (emit synthetic
__TABLE_CAPTION__: marker + table body) while still stripping non-TABLE
editorial comments. Result: 19 tables now render in the DOCX.

Other rendering fixes:
- LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥,
  ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,})
- Math-context-scoped sub/superscript via PUA sentinels (/):
  no more underscore-eating in identifiers like signature_analysis
- Display equations rendered via matplotlib mathtext to PNG (3 equations:
  cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as
  numbered equation blocks (1), (2), (3); content-addressed cache at
  paper/equations/ (gitignored, regenerable)
- Manual numbered/bulleted list rendering with hanging indent (replaces
  python-docx style="List Number" which silently drops the number prefix
  when no numbering definition is bound)
- Markdown blockquote (> ...) defensively stripped
- Pandoc footnote ([^name]) markers no longer leak (inlined at source)
- Heading text cleaned of LaTeX residue + PUA sentinels
- File paths in body text (signature_analysis/X.py, reports/Y.json)
  trimmed to "(reproduction artifact in Appendix B)" pointers

New leak linter: paper/lint_paper_v3.py - two-pass markdown source +
rendered DOCX leak detector; auto-runs at end of export_v3.py.

Script changes:
- 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR
  threshold list so Table XII-B is reproducible from persisted JSON
- 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly
  data (writes to reports/figures/ and reports/firm_yearly_comparison/)
- 31_within_year_ranking_robustness.py: NEW; supports the within-year
  robustness check (no longer cited in paper but kept as repo-internal
  due-diligence artifact)

Partner handoff DOCX shipped to
~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB:
19 tables + 4 figures + 3 equation images).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-06 13:44:49 +08:00
parent 623eb4cd4b
commit 53125d11d9
13 changed files with 1554 additions and 112 deletions
+399
View File
@@ -0,0 +1,399 @@
#!/usr/bin/env python3
"""Paper A v3 markdown / DOCX leak linter.
Runs two pass:
Source pass — scans the v3 markdown sources for syntax patterns that the
python-docx export pipeline does NOT render natively. Each finding is a
file:line:severity:message tuple. Severity is ERROR (will leak literal
syntax into Word), WARN (sometimes leaks), or INFO (style nits).
DOCX pass — opens the rendered DOCX and scans every paragraph and table
cell for known leak signatures. This is the authoritative check: even
if the source pass is clean, the DOCX pass tells you what your partner
will actually see. The DOCX pass currently checks for:
- leftover LaTeX commands (`\\cmd`)
- unstripped `$` math delimiters
- pandoc footnote markers (`[^name]`)
- markdown blockquote markers (lines starting with `> `)
- TeX brace tricks (`{=}`, `{,}`)
- PUA sentinels (`\\uE000`, `\\uE001`) leaking from the math-region
run-splitter
- the synthetic table-caption marker `__TABLE_CAPTION__:` if it ever
survives processing
Exit code:
0 clean
1 WARN-level findings only (ship-able after review)
2 ERROR-level findings (do NOT ship)
Usage:
python3 paper/lint_paper_v3.py # both passes
python3 paper/lint_paper_v3.py --source # source-side only
python3 paper/lint_paper_v3.py --docx # DOCX-side only
Designed to be run after `python3 export_v3.py` and before copying the
DOCX to ~/Downloads.
"""
from __future__ import annotations
import argparse
import re
import sys
from dataclasses import dataclass
from pathlib import Path
PAPER_DIR = Path(__file__).resolve().parent
DOCX_PATH = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
V3_SOURCES = [
"paper_a_abstract_v3.md",
"paper_a_introduction_v3.md",
"paper_a_related_work_v3.md",
"paper_a_methodology_v3.md",
"paper_a_results_v3.md",
"paper_a_discussion_v3.md",
"paper_a_conclusion_v3.md",
"paper_a_appendix_v3.md",
"paper_a_declarations_v3.md",
"paper_a_references_v3.md",
]
# ---------------------------------------------------------------------------
# Finding model + ANSI colour helpers
# ---------------------------------------------------------------------------
SEVERITY_RANK = {"ERROR": 2, "WARN": 1, "INFO": 0}
COLOR = {
"ERROR": "\033[31m", # red
"WARN": "\033[33m", # yellow
"INFO": "\033[36m", # cyan
"RESET": "\033[0m",
"BOLD": "\033[1m",
}
@dataclass
class Finding:
severity: str
rule: str
location: str # "file:line" or "DOCX:para 42" / "DOCX:table 6 row 3 col 2"
message: str
snippet: str = ""
def render(self, use_color: bool = True) -> str:
col = COLOR[self.severity] if use_color else ""
rst = COLOR["RESET"] if use_color else ""
bold = COLOR["BOLD"] if use_color else ""
head = f"{col}[{self.severity}]{rst} {bold}{self.rule}{rst} @ {self.location}"
body = f"\n {self.message}"
snip = f"\n > {self.snippet}" if self.snippet else ""
return head + body + snip
# ---------------------------------------------------------------------------
# Source-side rules
# ---------------------------------------------------------------------------
# Each rule: (pattern, severity, rule_id, message, predicate)
# predicate(match, line) → bool: returns True to keep the finding (lets us
# suppress matches that are inside HTML comments or fenced code blocks).
def _outside_table_comment(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
"""Suppress findings inside HTML comments (where they're allowed) or
inside markdown table rows (where they survive intact via add_md_table)."""
return not in_comment and not in_table
def _always(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
return True
SOURCE_RULES = [
# Pandoc footnote markers — leak as raw text in the DOCX.
(re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
"ERROR", "pandoc-footnote",
"Pandoc-style footnote `[^name]` does not render in DOCX. "
"Inline the explanation as a parenthetical instead.",
_outside_table_comment),
# Markdown blockquote `> body` lines — exporter strips them defensively
# now, but flag for awareness so authors don't rely on them rendering.
(re.compile(r"^>\s"),
"WARN", "blockquote",
"Markdown blockquote `> ...` is stripped to plain paragraph in DOCX "
"(no quote-block formatting). If you intended a callout, use bold "
"lead-in instead.",
_always),
# Display-math fences `$$...$$` (only when the line itself starts with
# `$$`) — exporter does best-effort linearisation, but the result is
# ugly. Inline the equation as plain prose where possible.
(re.compile(r"^\$\$.+?\$\$\s*$|^\$\$\s*$"),
"WARN", "display-math",
"Display math `$$...$$` renders as a best-effort plain-text "
"linearisation in DOCX (no MathType/equation rendering). Consider "
"replacing with a numbered equation image or inline prose.",
_always),
# Inline math containing `\frac{...{...}...}` — nested braces in a
# frac argument are not handled by the exporter's regex.
(re.compile(r"\\t?frac\{[^{}]*\{[^{}]*\}[^{}]*\}\{|\\t?frac\{[^{}]+\}\{[^{}]*\{"),
"WARN", "nested-frac",
"Nested-brace `\\frac{...}{...}` may not linearise cleanly. Verify "
"the rendered DOCX paragraph or rewrite the math inline.",
_outside_table_comment),
# Setext-style headers (=== / ---) under a line of text — not handled.
(re.compile(r"^=+\s*$|^-{3,}\s*$"),
"INFO", "setext-header",
"Setext-style header (=== / ---) is not handled by the exporter; "
"use ATX (#, ##, ###) instead.",
_always),
# Pandoc fenced div `:::` — not handled.
(re.compile(r"^:::"),
"ERROR", "pandoc-fenced-div",
"Pandoc fenced div `:::` is not handled by the exporter and would "
"leak into the DOCX as plain text.",
_always),
# Pandoc bracketed-attribute spans `[text]{.class}` — not handled.
(re.compile(r"\][\{][^}]*[\}]"),
"WARN", "pandoc-attribute-span",
"Pandoc attribute span `[text]{.class}` is not parsed by the exporter "
"and the brace block will leak.",
_outside_table_comment),
# File paths in body text — Appendix B is the canonical home for
# script→artifact references.
(re.compile(r"`signature_analysis/\d+_[a-z_]+\.py`"),
"INFO", "script-path-in-body",
"Verbose script path in body text. Consider replacing with "
"'(reproduction artifact in Appendix B)' for body-prose tightness.",
_outside_table_comment),
# `reports/...json` paths in body text — same rationale.
(re.compile(r"`reports/[a-z_]+/[a-z_]+\.(?:json|md)`"),
"INFO", "report-path-in-body",
"Verbose report-artifact path in body text. Consider replacing with "
"'(see Appendix B provenance map)'.",
_outside_table_comment),
# Bare HTML comments that are NOT TABLE/FIGURE markers may indicate
# editorial residue. Stripped wholesale by exporter, so harmless, but
# worth visibility.
(re.compile(r"^<!--\s*$|^<!-- (?!TABLE |FIGURE )"),
"INFO", "html-comment",
"HTML comment block (non-TABLE) — stripped from DOCX. Keep for "
"editorial notes or remove for tidiness.",
_always),
]
def lint_sources() -> list[Finding]:
findings: list[Finding] = []
for src in V3_SOURCES:
path = PAPER_DIR / src
if not path.exists():
continue
in_comment = False
in_table = False
for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1):
# Track HTML-comment context (multi-line aware).
if "<!--" in line:
in_comment = True
stripped = line.strip()
if stripped.startswith("|") and stripped.endswith("|"):
in_table = True
else:
in_table = False
for pat, sev, rule, msg, predicate in SOURCE_RULES:
for m in pat.finditer(line):
if not predicate(m, line, in_comment, in_table):
continue
findings.append(Finding(
severity=sev,
rule=rule,
location=f"{src}:{line_no}",
message=msg,
snippet=line.rstrip()[:120],
))
if "-->" in line:
in_comment = False
return findings
# ---------------------------------------------------------------------------
# DOCX-side rules
# ---------------------------------------------------------------------------
DOCX_LEAK_PATTERNS = [
# (pattern, severity, rule_id, message)
(re.compile(r"\\[a-zA-Z]+(?:\{[^{}]*\})?"),
"ERROR", "leftover-latex-cmd",
"LaTeX command `\\cmd` leaked into DOCX. Either add a token rule to "
"`latex_to_unicode` in `export_v3.py` or rewrite the source as plain text."),
(re.compile(r"(?<!\\)\$[^$\s][^$]*\$"),
"ERROR", "unstripped-dollar-math",
"Inline math `$...$` was not stripped. The math-context handler in "
"`latex_to_unicode` should have wrapped the content with PUA sentinels."),
(re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
"ERROR", "pandoc-footnote-leak",
"Pandoc footnote marker leaked into DOCX. Inline the footnote body "
"as a parenthetical at the source."),
(re.compile(r"^>\s"),
"ERROR", "blockquote-leak",
"Markdown blockquote `> ...` leaked literal `>` into DOCX. The "
"exporter pre-pass should strip these — check `process_section`."),
(re.compile(r"\{[,=<>+\-]\}"),
"ERROR", "tex-brace-trick",
"TeX brace-trick `{=}` / `{,}` leaked. Should be stripped by "
"`latex_to_unicode`."),
(re.compile(r"[]"),
"ERROR", "pua-sentinel-leak",
"Math-region PUA sentinel (\\uE000 / \\uE001) leaked. A render path "
"is bypassing `add_text_with_subsup`; check headings / list items / "
"title-page paragraphs."),
(re.compile(r"__TABLE_CAPTION__"),
"ERROR", "table-caption-marker-leak",
"Synthetic `__TABLE_CAPTION__:` marker leaked. The marker is meant "
"to be consumed by `process_section` and rendered as a centered "
"bold caption paragraph."),
(re.compile(r"signature[a-z]+analysis/\d+[a-z_]+\.py"),
"ERROR", "underscore-eaten-path",
"Underscores eaten from a script path (e.g., "
"`signatureanalysis/28byteidentitydecomposition.py`). The "
"math-context-scoped subscript handler in `add_text_with_subsup` "
"should leave underscores intact in plain text."),
(re.compile(r"\b(\w+_\w+)+\b", flags=re.UNICODE),
"INFO", "underscore-identifier",
"Underscored identifier in body text (e.g., a code symbol or path). "
"Verify it renders with underscores intact, not as subscripts."),
]
def lint_docx(docx_path: Path = DOCX_PATH) -> list[Finding]:
try:
from docx import Document
except ImportError:
return [Finding("ERROR", "missing-dep",
"lint:docx",
"python-docx is not installed; cannot run DOCX pass.")]
if not docx_path.exists():
return [Finding("ERROR", "missing-docx",
str(docx_path),
"Built DOCX not found. Run `python3 export_v3.py` first.")]
doc = Document(str(docx_path))
findings: list[Finding] = []
seen_signatures = set() # dedupe identical leaks across paragraphs
def scan(text: str, location: str):
for pat, sev, rule, msg in DOCX_LEAK_PATTERNS:
for m in pat.finditer(text):
# Skip the INFO-level identifier rule unless it looks like
# an obvious math residue (e.g., dHash_indep or N_a).
if rule == "underscore-identifier":
sample = m.group(0)
# Only complain about identifiers that look like math
# residue: short, underscore-separated single-char tokens.
parts = sample.split("_")
if not all(len(p) <= 4 for p in parts):
continue
if not all(p.isalnum() and not p.isdigit() for p in parts):
continue
key = (rule, m.group(0))
if key in seen_signatures:
continue
seen_signatures.add(key)
findings.append(Finding(
severity=sev,
rule=rule,
location=location,
message=msg,
snippet=text[max(0, m.start() - 30):m.end() + 30].replace("\n", " ")[:140],
))
for i, p in enumerate(doc.paragraphs):
if p.text:
scan(p.text, f"DOCX:para {i}")
for ti, t in enumerate(doc.tables):
for ri, row in enumerate(t.rows):
for ci, cell in enumerate(row.cells):
if cell.text:
scan(cell.text, f"DOCX:table {ti + 1} row {ri} col {ci}")
return findings
# ---------------------------------------------------------------------------
# Reporter
# ---------------------------------------------------------------------------
def summarise(findings: list[Finding], use_color: bool = True) -> int:
def c(key: str) -> str:
return COLOR[key] if use_color else ""
if not findings:
print(f"{c('BOLD')}{c('INFO')}clean — no leaks detected{c('RESET')}")
return 0
counts = {"ERROR": 0, "WARN": 0, "INFO": 0}
findings.sort(key=lambda f: (-SEVERITY_RANK[f.severity], f.location))
for f in findings:
counts[f.severity] += 1
print(f.render(use_color))
print()
print(f"{c('BOLD')}summary{c('RESET')}: "
f"{c('ERROR')}{counts['ERROR']} ERROR{c('RESET')} "
f"{c('WARN')}{counts['WARN']} WARN{c('RESET')} "
f"{c('INFO')}{counts['INFO']} INFO{c('RESET')}")
if counts["ERROR"]:
return 2
if counts["WARN"]:
return 1
return 0
def main():
ap = argparse.ArgumentParser(
description="Lint Paper A v3 markdown sources and rendered DOCX for "
"syntax-leak issues.",
)
ap.add_argument("--source", action="store_true",
help="run only the markdown source pass")
ap.add_argument("--docx", action="store_true",
help="run only the rendered DOCX pass")
ap.add_argument("--no-color", action="store_true",
help="disable ANSI colour output")
args = ap.parse_args()
use_color = sys.stdout.isatty() and not args.no_color
findings: list[Finding] = []
if args.source or not (args.source or args.docx):
print(f"{COLOR['BOLD'] if use_color else ''}--- source pass "
f"({len(V3_SOURCES)} files) ---{COLOR['RESET'] if use_color else ''}")
findings.extend(lint_sources())
if args.docx or not (args.source or args.docx):
print(f"{COLOR['BOLD'] if use_color else ''}\n--- docx pass "
f"({DOCX_PATH.name}) ---{COLOR['RESET'] if use_color else ''}")
findings.extend(lint_docx())
print()
sys.exit(summarise(findings, use_color))
if __name__ == "__main__":
main()