53125d11d9
Substantive content (addresses partner Jimmy's 2026-04-27 review of v3.19.1): Must-fix items (6/6): - §III-F SSIM/pixel rejection rewritten from first principles (design-level argument from luminance/contrast/structure local-window product, not the prior empirical 0.70 result) - Table VI restructured by population × method; added missing Firm A logit-Gaussian-2 0.999 row; KDE marked undefined (unimodal), BD/McCrary marked bin-unstable (Appendix A) - Tables IX / XI / §IV-F.3 dHash 5/8/15 inconsistency resolved: ≤8 demoted from "operational dual" to "calibration-fold-adjacent reference"; the actual classifier rule cos>0.95 AND dH≤15 = 92.46% added throughout - New Fig. 4 (yearly per-firm best-match cosine, 5 lines, 2013-2023, Firm A on top); script 30_yearly_big4_comparison.py - Tables XIV / XV extended with top-20% (94.8%) and top-30% (81.3%) brackets - §III-K reframed P7.5 from "round-number lower-tail boundary" to operating point; new Table XII-B (cosine-FAR-capture tradeoff at 5 thresholds: 0.9407 / 0.945 / 0.95 / 0.977 / 0.985) Nice-to-have items (3/3): - Table XII expanded to 6-cut classifier sensitivity grid (0.940-0.985) - Defensive parentheticals (84,386 vs 85,042; 30,226 vs 30,222) moved to table notes; cut "invite reviewer skepticism" and "non-load-bearing" Codex 3-pass verification cleanup: - Stale 0.973/0.977/0.979 references unified on canonical 0.977 (Firm A Beta-2 forced-fit crossing from beta_mixture_results.json) - dHash≤8 wording corrected to P95-adjacent (P95 = 9, ≤8 is the integer immediately below) instead of misleading "rounded down" - Table XII-B prose corrected: per-segment qualification of "non-Firm-A capture falls faster" (true on 0.95→0.977 segment but contracts on 0.977→0.985 segment); arithmetic now from exact counts Within-year analyses removed: - Within-year ranking robustness check (Class A) was added in nice-to-have pass but contradicts v3.14 A2-removal stance; removed from §IV-G.2 + the Appendix B provenance row - Within-CPA future-work disclosures (Class B) removed from Discussion limitation #5 and Conclusion future-work paragraph; subsequent limitations renumbered Sixth → Fifth, Seventh → Sixth DOCX rendering pipeline overhaul (paper/export_v3.py): Critical fix - every v3 DOCX since v3.0 was shipping WITHOUT TABLES: strip_comments() was wholesale-deleting HTML comments, but every numerical table is wrapped in <!-- TABLE X: ... -->, so the table body was deleted alongside the wrapper. Now unwraps TABLE comments (emit synthetic __TABLE_CAPTION__: marker + table body) while still stripping non-TABLE editorial comments. Result: 19 tables now render in the DOCX. Other rendering fixes: - LaTeX → Unicode conversion (50+ token replacements: Greek alphabet, ≤≥, ×·≈, →↔⇒, etc.); \frac/\sqrt linearisation; TeX brace tricks ({=}, {,}) - Math-context-scoped sub/superscript via PUA sentinels (/): no more underscore-eating in identifiers like signature_analysis - Display equations rendered via matplotlib mathtext to PNG (3 equations: cosine sim, mixture crossing, BD/McCrary Z statistic), embedded as numbered equation blocks (1), (2), (3); content-addressed cache at paper/equations/ (gitignored, regenerable) - Manual numbered/bulleted list rendering with hanging indent (replaces python-docx style="List Number" which silently drops the number prefix when no numbering definition is bound) - Markdown blockquote (> ...) defensively stripped - Pandoc footnote ([^name]) markers no longer leak (inlined at source) - Heading text cleaned of LaTeX residue + PUA sentinels - File paths in body text (signature_analysis/X.py, reports/Y.json) trimmed to "(reproduction artifact in Appendix B)" pointers New leak linter: paper/lint_paper_v3.py - two-pass markdown source + rendered DOCX leak detector; auto-runs at end of export_v3.py. Script changes: - 21_expanded_validation.py: added 0.9407, 0.977, 0.985 to canonical FAR threshold list so Table XII-B is reproducible from persisted JSON - 30_yearly_big4_comparison.py: NEW; generates Fig. 4 + per-firm yearly data (writes to reports/figures/ and reports/firm_yearly_comparison/) - 31_within_year_ranking_robustness.py: NEW; supports the within-year robustness check (no longer cited in paper but kept as repo-internal due-diligence artifact) Partner handoff DOCX shipped to ~/Downloads/Paper_A_IEEE_Access_Draft_v3.20.0_20260505.docx (536 KB: 19 tables + 4 figures + 3 equation images). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
400 lines
15 KiB
Python
400 lines
15 KiB
Python
#!/usr/bin/env python3
|
|
"""Paper A v3 markdown / DOCX leak linter.
|
|
|
|
Runs two pass:
|
|
|
|
Source pass — scans the v3 markdown sources for syntax patterns that the
|
|
python-docx export pipeline does NOT render natively. Each finding is a
|
|
file:line:severity:message tuple. Severity is ERROR (will leak literal
|
|
syntax into Word), WARN (sometimes leaks), or INFO (style nits).
|
|
|
|
DOCX pass — opens the rendered DOCX and scans every paragraph and table
|
|
cell for known leak signatures. This is the authoritative check: even
|
|
if the source pass is clean, the DOCX pass tells you what your partner
|
|
will actually see. The DOCX pass currently checks for:
|
|
|
|
- leftover LaTeX commands (`\\cmd`)
|
|
- unstripped `$` math delimiters
|
|
- pandoc footnote markers (`[^name]`)
|
|
- markdown blockquote markers (lines starting with `> `)
|
|
- TeX brace tricks (`{=}`, `{,}`)
|
|
- PUA sentinels (`\\uE000`, `\\uE001`) leaking from the math-region
|
|
run-splitter
|
|
- the synthetic table-caption marker `__TABLE_CAPTION__:` if it ever
|
|
survives processing
|
|
|
|
Exit code:
|
|
0 clean
|
|
1 WARN-level findings only (ship-able after review)
|
|
2 ERROR-level findings (do NOT ship)
|
|
|
|
Usage:
|
|
python3 paper/lint_paper_v3.py # both passes
|
|
python3 paper/lint_paper_v3.py --source # source-side only
|
|
python3 paper/lint_paper_v3.py --docx # DOCX-side only
|
|
|
|
Designed to be run after `python3 export_v3.py` and before copying the
|
|
DOCX to ~/Downloads.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import re
|
|
import sys
|
|
from dataclasses import dataclass
|
|
from pathlib import Path
|
|
|
|
PAPER_DIR = Path(__file__).resolve().parent
|
|
DOCX_PATH = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
|
|
|
|
V3_SOURCES = [
|
|
"paper_a_abstract_v3.md",
|
|
"paper_a_introduction_v3.md",
|
|
"paper_a_related_work_v3.md",
|
|
"paper_a_methodology_v3.md",
|
|
"paper_a_results_v3.md",
|
|
"paper_a_discussion_v3.md",
|
|
"paper_a_conclusion_v3.md",
|
|
"paper_a_appendix_v3.md",
|
|
"paper_a_declarations_v3.md",
|
|
"paper_a_references_v3.md",
|
|
]
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Finding model + ANSI colour helpers
|
|
# ---------------------------------------------------------------------------
|
|
|
|
SEVERITY_RANK = {"ERROR": 2, "WARN": 1, "INFO": 0}
|
|
COLOR = {
|
|
"ERROR": "\033[31m", # red
|
|
"WARN": "\033[33m", # yellow
|
|
"INFO": "\033[36m", # cyan
|
|
"RESET": "\033[0m",
|
|
"BOLD": "\033[1m",
|
|
}
|
|
|
|
|
|
@dataclass
|
|
class Finding:
|
|
severity: str
|
|
rule: str
|
|
location: str # "file:line" or "DOCX:para 42" / "DOCX:table 6 row 3 col 2"
|
|
message: str
|
|
snippet: str = ""
|
|
|
|
def render(self, use_color: bool = True) -> str:
|
|
col = COLOR[self.severity] if use_color else ""
|
|
rst = COLOR["RESET"] if use_color else ""
|
|
bold = COLOR["BOLD"] if use_color else ""
|
|
head = f"{col}[{self.severity}]{rst} {bold}{self.rule}{rst} @ {self.location}"
|
|
body = f"\n {self.message}"
|
|
snip = f"\n > {self.snippet}" if self.snippet else ""
|
|
return head + body + snip
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Source-side rules
|
|
# ---------------------------------------------------------------------------
|
|
|
|
# Each rule: (pattern, severity, rule_id, message, predicate)
|
|
# predicate(match, line) → bool: returns True to keep the finding (lets us
|
|
# suppress matches that are inside HTML comments or fenced code blocks).
|
|
|
|
def _outside_table_comment(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
|
|
"""Suppress findings inside HTML comments (where they're allowed) or
|
|
inside markdown table rows (where they survive intact via add_md_table)."""
|
|
return not in_comment and not in_table
|
|
|
|
|
|
def _always(match: re.Match, line: str, in_comment: bool, in_table: bool) -> bool:
|
|
return True
|
|
|
|
|
|
SOURCE_RULES = [
|
|
# Pandoc footnote markers — leak as raw text in the DOCX.
|
|
(re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
|
|
"ERROR", "pandoc-footnote",
|
|
"Pandoc-style footnote `[^name]` does not render in DOCX. "
|
|
"Inline the explanation as a parenthetical instead.",
|
|
_outside_table_comment),
|
|
|
|
# Markdown blockquote `> body` lines — exporter strips them defensively
|
|
# now, but flag for awareness so authors don't rely on them rendering.
|
|
(re.compile(r"^>\s"),
|
|
"WARN", "blockquote",
|
|
"Markdown blockquote `> ...` is stripped to plain paragraph in DOCX "
|
|
"(no quote-block formatting). If you intended a callout, use bold "
|
|
"lead-in instead.",
|
|
_always),
|
|
|
|
# Display-math fences `$$...$$` (only when the line itself starts with
|
|
# `$$`) — exporter does best-effort linearisation, but the result is
|
|
# ugly. Inline the equation as plain prose where possible.
|
|
(re.compile(r"^\$\$.+?\$\$\s*$|^\$\$\s*$"),
|
|
"WARN", "display-math",
|
|
"Display math `$$...$$` renders as a best-effort plain-text "
|
|
"linearisation in DOCX (no MathType/equation rendering). Consider "
|
|
"replacing with a numbered equation image or inline prose.",
|
|
_always),
|
|
|
|
# Inline math containing `\frac{...{...}...}` — nested braces in a
|
|
# frac argument are not handled by the exporter's regex.
|
|
(re.compile(r"\\t?frac\{[^{}]*\{[^{}]*\}[^{}]*\}\{|\\t?frac\{[^{}]+\}\{[^{}]*\{"),
|
|
"WARN", "nested-frac",
|
|
"Nested-brace `\\frac{...}{...}` may not linearise cleanly. Verify "
|
|
"the rendered DOCX paragraph or rewrite the math inline.",
|
|
_outside_table_comment),
|
|
|
|
# Setext-style headers (=== / ---) under a line of text — not handled.
|
|
(re.compile(r"^=+\s*$|^-{3,}\s*$"),
|
|
"INFO", "setext-header",
|
|
"Setext-style header (=== / ---) is not handled by the exporter; "
|
|
"use ATX (#, ##, ###) instead.",
|
|
_always),
|
|
|
|
# Pandoc fenced div `:::` — not handled.
|
|
(re.compile(r"^:::"),
|
|
"ERROR", "pandoc-fenced-div",
|
|
"Pandoc fenced div `:::` is not handled by the exporter and would "
|
|
"leak into the DOCX as plain text.",
|
|
_always),
|
|
|
|
# Pandoc bracketed-attribute spans `[text]{.class}` — not handled.
|
|
(re.compile(r"\][\{][^}]*[\}]"),
|
|
"WARN", "pandoc-attribute-span",
|
|
"Pandoc attribute span `[text]{.class}` is not parsed by the exporter "
|
|
"and the brace block will leak.",
|
|
_outside_table_comment),
|
|
|
|
# File paths in body text — Appendix B is the canonical home for
|
|
# script→artifact references.
|
|
(re.compile(r"`signature_analysis/\d+_[a-z_]+\.py`"),
|
|
"INFO", "script-path-in-body",
|
|
"Verbose script path in body text. Consider replacing with "
|
|
"'(reproduction artifact in Appendix B)' for body-prose tightness.",
|
|
_outside_table_comment),
|
|
|
|
# `reports/...json` paths in body text — same rationale.
|
|
(re.compile(r"`reports/[a-z_]+/[a-z_]+\.(?:json|md)`"),
|
|
"INFO", "report-path-in-body",
|
|
"Verbose report-artifact path in body text. Consider replacing with "
|
|
"'(see Appendix B provenance map)'.",
|
|
_outside_table_comment),
|
|
|
|
# Bare HTML comments that are NOT TABLE/FIGURE markers may indicate
|
|
# editorial residue. Stripped wholesale by exporter, so harmless, but
|
|
# worth visibility.
|
|
(re.compile(r"^<!--\s*$|^<!-- (?!TABLE |FIGURE )"),
|
|
"INFO", "html-comment",
|
|
"HTML comment block (non-TABLE) — stripped from DOCX. Keep for "
|
|
"editorial notes or remove for tidiness.",
|
|
_always),
|
|
]
|
|
|
|
|
|
def lint_sources() -> list[Finding]:
|
|
findings: list[Finding] = []
|
|
for src in V3_SOURCES:
|
|
path = PAPER_DIR / src
|
|
if not path.exists():
|
|
continue
|
|
in_comment = False
|
|
in_table = False
|
|
for line_no, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1):
|
|
# Track HTML-comment context (multi-line aware).
|
|
if "<!--" in line:
|
|
in_comment = True
|
|
stripped = line.strip()
|
|
if stripped.startswith("|") and stripped.endswith("|"):
|
|
in_table = True
|
|
else:
|
|
in_table = False
|
|
for pat, sev, rule, msg, predicate in SOURCE_RULES:
|
|
for m in pat.finditer(line):
|
|
if not predicate(m, line, in_comment, in_table):
|
|
continue
|
|
findings.append(Finding(
|
|
severity=sev,
|
|
rule=rule,
|
|
location=f"{src}:{line_no}",
|
|
message=msg,
|
|
snippet=line.rstrip()[:120],
|
|
))
|
|
if "-->" in line:
|
|
in_comment = False
|
|
return findings
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# DOCX-side rules
|
|
# ---------------------------------------------------------------------------
|
|
|
|
DOCX_LEAK_PATTERNS = [
|
|
# (pattern, severity, rule_id, message)
|
|
(re.compile(r"\\[a-zA-Z]+(?:\{[^{}]*\})?"),
|
|
"ERROR", "leftover-latex-cmd",
|
|
"LaTeX command `\\cmd` leaked into DOCX. Either add a token rule to "
|
|
"`latex_to_unicode` in `export_v3.py` or rewrite the source as plain text."),
|
|
|
|
(re.compile(r"(?<!\\)\$[^$\s][^$]*\$"),
|
|
"ERROR", "unstripped-dollar-math",
|
|
"Inline math `$...$` was not stripped. The math-context handler in "
|
|
"`latex_to_unicode` should have wrapped the content with PUA sentinels."),
|
|
|
|
(re.compile(r"\[\^[A-Za-z0-9_-]+\]"),
|
|
"ERROR", "pandoc-footnote-leak",
|
|
"Pandoc footnote marker leaked into DOCX. Inline the footnote body "
|
|
"as a parenthetical at the source."),
|
|
|
|
(re.compile(r"^>\s"),
|
|
"ERROR", "blockquote-leak",
|
|
"Markdown blockquote `> ...` leaked literal `>` into DOCX. The "
|
|
"exporter pre-pass should strip these — check `process_section`."),
|
|
|
|
(re.compile(r"\{[,=<>+\-]\}"),
|
|
"ERROR", "tex-brace-trick",
|
|
"TeX brace-trick `{=}` / `{,}` leaked. Should be stripped by "
|
|
"`latex_to_unicode`."),
|
|
|
|
(re.compile(r"[]"),
|
|
"ERROR", "pua-sentinel-leak",
|
|
"Math-region PUA sentinel (\\uE000 / \\uE001) leaked. A render path "
|
|
"is bypassing `add_text_with_subsup`; check headings / list items / "
|
|
"title-page paragraphs."),
|
|
|
|
(re.compile(r"__TABLE_CAPTION__"),
|
|
"ERROR", "table-caption-marker-leak",
|
|
"Synthetic `__TABLE_CAPTION__:` marker leaked. The marker is meant "
|
|
"to be consumed by `process_section` and rendered as a centered "
|
|
"bold caption paragraph."),
|
|
|
|
(re.compile(r"signature[a-z]+analysis/\d+[a-z_]+\.py"),
|
|
"ERROR", "underscore-eaten-path",
|
|
"Underscores eaten from a script path (e.g., "
|
|
"`signatureanalysis/28byteidentitydecomposition.py`). The "
|
|
"math-context-scoped subscript handler in `add_text_with_subsup` "
|
|
"should leave underscores intact in plain text."),
|
|
|
|
(re.compile(r"\b(\w+_\w+)+\b", flags=re.UNICODE),
|
|
"INFO", "underscore-identifier",
|
|
"Underscored identifier in body text (e.g., a code symbol or path). "
|
|
"Verify it renders with underscores intact, not as subscripts."),
|
|
]
|
|
|
|
|
|
def lint_docx(docx_path: Path = DOCX_PATH) -> list[Finding]:
|
|
try:
|
|
from docx import Document
|
|
except ImportError:
|
|
return [Finding("ERROR", "missing-dep",
|
|
"lint:docx",
|
|
"python-docx is not installed; cannot run DOCX pass.")]
|
|
|
|
if not docx_path.exists():
|
|
return [Finding("ERROR", "missing-docx",
|
|
str(docx_path),
|
|
"Built DOCX not found. Run `python3 export_v3.py` first.")]
|
|
|
|
doc = Document(str(docx_path))
|
|
findings: list[Finding] = []
|
|
seen_signatures = set() # dedupe identical leaks across paragraphs
|
|
|
|
def scan(text: str, location: str):
|
|
for pat, sev, rule, msg in DOCX_LEAK_PATTERNS:
|
|
for m in pat.finditer(text):
|
|
# Skip the INFO-level identifier rule unless it looks like
|
|
# an obvious math residue (e.g., dHash_indep or N_a).
|
|
if rule == "underscore-identifier":
|
|
sample = m.group(0)
|
|
# Only complain about identifiers that look like math
|
|
# residue: short, underscore-separated single-char tokens.
|
|
parts = sample.split("_")
|
|
if not all(len(p) <= 4 for p in parts):
|
|
continue
|
|
if not all(p.isalnum() and not p.isdigit() for p in parts):
|
|
continue
|
|
key = (rule, m.group(0))
|
|
if key in seen_signatures:
|
|
continue
|
|
seen_signatures.add(key)
|
|
findings.append(Finding(
|
|
severity=sev,
|
|
rule=rule,
|
|
location=location,
|
|
message=msg,
|
|
snippet=text[max(0, m.start() - 30):m.end() + 30].replace("\n", " ")[:140],
|
|
))
|
|
|
|
for i, p in enumerate(doc.paragraphs):
|
|
if p.text:
|
|
scan(p.text, f"DOCX:para {i}")
|
|
for ti, t in enumerate(doc.tables):
|
|
for ri, row in enumerate(t.rows):
|
|
for ci, cell in enumerate(row.cells):
|
|
if cell.text:
|
|
scan(cell.text, f"DOCX:table {ti + 1} row {ri} col {ci}")
|
|
|
|
return findings
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Reporter
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def summarise(findings: list[Finding], use_color: bool = True) -> int:
|
|
def c(key: str) -> str:
|
|
return COLOR[key] if use_color else ""
|
|
|
|
if not findings:
|
|
print(f"{c('BOLD')}{c('INFO')}clean — no leaks detected{c('RESET')}")
|
|
return 0
|
|
counts = {"ERROR": 0, "WARN": 0, "INFO": 0}
|
|
findings.sort(key=lambda f: (-SEVERITY_RANK[f.severity], f.location))
|
|
for f in findings:
|
|
counts[f.severity] += 1
|
|
print(f.render(use_color))
|
|
print()
|
|
print(f"{c('BOLD')}summary{c('RESET')}: "
|
|
f"{c('ERROR')}{counts['ERROR']} ERROR{c('RESET')} "
|
|
f"{c('WARN')}{counts['WARN']} WARN{c('RESET')} "
|
|
f"{c('INFO')}{counts['INFO']} INFO{c('RESET')}")
|
|
if counts["ERROR"]:
|
|
return 2
|
|
if counts["WARN"]:
|
|
return 1
|
|
return 0
|
|
|
|
|
|
def main():
|
|
ap = argparse.ArgumentParser(
|
|
description="Lint Paper A v3 markdown sources and rendered DOCX for "
|
|
"syntax-leak issues.",
|
|
)
|
|
ap.add_argument("--source", action="store_true",
|
|
help="run only the markdown source pass")
|
|
ap.add_argument("--docx", action="store_true",
|
|
help="run only the rendered DOCX pass")
|
|
ap.add_argument("--no-color", action="store_true",
|
|
help="disable ANSI colour output")
|
|
args = ap.parse_args()
|
|
|
|
use_color = sys.stdout.isatty() and not args.no_color
|
|
findings: list[Finding] = []
|
|
if args.source or not (args.source or args.docx):
|
|
print(f"{COLOR['BOLD'] if use_color else ''}--- source pass "
|
|
f"({len(V3_SOURCES)} files) ---{COLOR['RESET'] if use_color else ''}")
|
|
findings.extend(lint_sources())
|
|
if args.docx or not (args.source or args.docx):
|
|
print(f"{COLOR['BOLD'] if use_color else ''}\n--- docx pass "
|
|
f"({DOCX_PATH.name}) ---{COLOR['RESET'] if use_color else ''}")
|
|
findings.extend(lint_docx())
|
|
|
|
print()
|
|
sys.exit(summarise(findings, use_color))
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|