552b6b80d4
Implements codex gpt-5.4 recommendation (paper/codex_bd_mccrary_opinion.md,
"option (c) hybrid"): demote BD/McCrary in the main text from a co-equal
threshold estimator to a density-smoothness diagnostic, and add a
bin-width sensitivity appendix as an audit trail.
Why: the bin-width sweep (Script 25) confirms that at the signature
level the BD transition drifts monotonically with bin width (Firm A
cosine: 0.987 -> 0.985 -> 0.980 -> 0.975 as bin width widens 0.003 ->
0.015; full-sample dHash transitions drift from 2 to 10 to 9 across
bin widths 1 / 2 / 3) and Z statistics inflate superlinearly with bin
width, both characteristic of a histogram-resolution artifact. At the
accountant level the BD null is robust across the sweep. The paper's
earlier "three methodologically distinct estimators" framing therefore
could not be defended to an IEEE Access reviewer once the sweep was
run.
Added
- signature_analysis/25_bd_mccrary_sensitivity.py: bin-width sweep
across 6 variants (Firm A / full-sample / accountant-level, each
cosine + dHash_indep) and 3-4 bin widths per variant. Reports
Z_below, Z_above, p-values, and number of significant transitions
per cell. Writes reports/bd_sensitivity/bd_sensitivity.{json,md}.
- paper/paper_a_appendix_v3.md: new "Appendix A. BD/McCrary Bin-Width
Sensitivity" with Table A.I (all 20 sensitivity cells) and
interpretation linking the empirical pattern to the main-text
framing decision.
- export_v3.py: appendix inserted into SECTIONS between conclusion
and references.
- paper/codex_bd_mccrary_opinion.md: codex gpt-5.4 recommendation
captured verbatim for audit trail.
Main-text reframing
- Abstract: "three methodologically distinct estimators" ->
"two estimators plus a Burgstahler-Dichev/McCrary density-
smoothness diagnostic". Trimmed to 243 words.
- Introduction: related-work summary, pipeline step 5, accountant-
level convergence sentence, contribution 4, and section-outline
line all updated. Contribution 4 renamed to "Convergent threshold
framework with a smoothness diagnostic".
- Methodology III-I: section renamed to "Convergent Threshold
Determination with a Density-Smoothness Diagnostic". "Method 2:
BD/McCrary Discontinuity" converted to "Density-Smoothness
Diagnostic" in a new subsection; Method 3 (Beta mixture) renumbered
to Method 2. Subsections 4 and 5 updated to refer to "two threshold
estimators" with BD as diagnostic.
- Methodology III-A pipeline overview: "three methodologically
distinct statistical methods" -> "two methodologically distinct
threshold estimators complemented by a density-smoothness
diagnostic".
- Methodology III-L: "three-method analysis" -> "accountant-level
threshold analysis (KDE antimode, Beta-2 crossing, logit-Gaussian
robustness crossing)".
- Results IV-D.1 heading: "BD/McCrary Discontinuity" ->
"BD/McCrary Density-Smoothness Diagnostic". Prose now notes the
Appendix-A bin-width instability explicitly.
- Results IV-E: Table VIII restructured to label BD rows
"(diagnostic only; bin-unstable)" and "(diagnostic; null across
Appendix A)". Summary sentence rewritten to frame BD null as
evidence for clustered-but-smoothly-mixed rather than as a
convergence failure. Table cosine P5 row corrected from 0.941 to
0.9407 to match III-K.
- Results IV-G.3 and IV-I.2: "three-method convergence/thresholds"
-> "accountant-level convergent thresholds" (clarifies the 3
converging estimates are KDE antimode, Beta-2, logit-Gaussian,
not KDE/BD/Beta).
- Discussion V-B: "three-method framework" -> "convergent threshold
framework".
- Conclusion: "three methodologically distinct methods" -> "two
threshold estimators and a density-smoothness diagnostic";
contribution 3 restated; future-work sentence updated.
- Impact Statement (archived): "three methodologically distinct
threshold-selection methods" -> "two methodologically distinct
threshold estimators plus a density-smoothness diagnostic" so the
archived text is internally consistent if reused.
Discussion V-B / V-G already framed BD as a diagnostic in v3.5
(unchanged in this commit). The reframing therefore brings Abstract /
Introduction / Methodology / Results / Conclusion into alignment with
the Discussion framing that codex had already endorsed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
247 lines
8.7 KiB
Python
247 lines
8.7 KiB
Python
#!/usr/bin/env python3
|
|
"""Export Paper A v3 (IEEE Access target) to Word, reading from v3 md section files."""
|
|
|
|
from docx import Document
|
|
from docx.shared import Inches, Pt, RGBColor
|
|
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
|
from pathlib import Path
|
|
import re
|
|
|
|
PAPER_DIR = Path("/Volumes/NV2/pdf_recognize/paper")
|
|
FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/paper_figures")
|
|
EXTRA_FIG_DIR = Path("/Volumes/NV2/PDF-Processing/signature-analysis/reports")
|
|
OUTPUT = PAPER_DIR / "Paper_A_IEEE_Access_Draft_v3.docx"
|
|
|
|
SECTIONS = [
|
|
"paper_a_abstract_v3.md",
|
|
# paper_a_impact_statement_v3.md removed: not a standard IEEE Access
|
|
# Regular Paper section. Content folded into cover letter / abstract.
|
|
"paper_a_introduction_v3.md",
|
|
"paper_a_related_work_v3.md",
|
|
"paper_a_methodology_v3.md",
|
|
"paper_a_results_v3.md",
|
|
"paper_a_discussion_v3.md",
|
|
"paper_a_conclusion_v3.md",
|
|
# Appendix A: BD/McCrary bin-width sensitivity (see v3.7 notes).
|
|
"paper_a_appendix_v3.md",
|
|
"paper_a_references_v3.md",
|
|
]
|
|
|
|
# Figure insertion hooks (trigger phrase -> (file, caption, width inches)).
|
|
# New figures for v3: dip test, BD/McCrary overlays, accountant GMM 2D + marginals.
|
|
FIGURES = {
|
|
"Fig. 1 illustrates": (
|
|
FIG_DIR / "fig1_pipeline.png",
|
|
"Fig. 1. Pipeline architecture for automated non-hand-signed signature detection.",
|
|
6.5,
|
|
),
|
|
"Fig. 2 presents the cosine similarity distributions for intra-class": (
|
|
FIG_DIR / "fig2_intra_inter_kde.png",
|
|
"Fig. 2. Cosine similarity distributions: intra-class vs. inter-class with KDE crossover at 0.837.",
|
|
3.5,
|
|
),
|
|
"Fig. 3 presents the per-signature cosine and dHash distributions of Firm A": (
|
|
FIG_DIR / "fig3_firm_a_calibration.png",
|
|
"Fig. 3. Firm A per-signature cosine and dHash distributions against the overall CPA population.",
|
|
3.5,
|
|
),
|
|
"Fig. 4 visualizes the accountant-level clusters": (
|
|
EXTRA_FIG_DIR / "accountant_mixture" / "accountant_mixture_2d.png",
|
|
"Fig. 4. Accountant-level 3-component Gaussian mixture in the (cosine-mean, dHash-mean) plane.",
|
|
4.5,
|
|
),
|
|
"conducted an ablation study comparing three": (
|
|
FIG_DIR / "fig4_ablation.png",
|
|
"Fig. 5. Ablation study comparing three feature extraction backbones.",
|
|
6.5,
|
|
),
|
|
}
|
|
|
|
|
|
def strip_comments(text):
|
|
return re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
|
|
|
|
|
|
def add_md_table(doc, table_lines):
|
|
rows_data = []
|
|
for line in table_lines:
|
|
cells = [c.strip() for c in line.strip("|").split("|")]
|
|
if not re.match(r"^[-: ]+$", cells[0]):
|
|
rows_data.append(cells)
|
|
if len(rows_data) < 2:
|
|
return
|
|
ncols = len(rows_data[0])
|
|
table = doc.add_table(rows=len(rows_data), cols=ncols)
|
|
table.style = "Table Grid"
|
|
for r_idx, row in enumerate(rows_data):
|
|
for c_idx in range(min(len(row), ncols)):
|
|
cell = table.rows[r_idx].cells[c_idx]
|
|
cell.text = row[c_idx]
|
|
for p in cell.paragraphs:
|
|
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
for run in p.runs:
|
|
run.font.size = Pt(8)
|
|
run.font.name = "Times New Roman"
|
|
if r_idx == 0:
|
|
run.bold = True
|
|
doc.add_paragraph()
|
|
|
|
|
|
def _insert_figures(doc, para_text):
|
|
for trigger, (fig_path, caption, width) in FIGURES.items():
|
|
if trigger in para_text and Path(fig_path).exists():
|
|
fp = doc.add_paragraph()
|
|
fp.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
fr = fp.add_run()
|
|
fr.add_picture(str(fig_path), width=Inches(width))
|
|
cp = doc.add_paragraph()
|
|
cp.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
cr = cp.add_run(caption)
|
|
cr.font.size = Pt(9)
|
|
cr.font.name = "Times New Roman"
|
|
cr.italic = True
|
|
|
|
|
|
def process_section(doc, filepath):
|
|
text = filepath.read_text(encoding="utf-8")
|
|
text = strip_comments(text)
|
|
lines = text.split("\n")
|
|
i = 0
|
|
while i < len(lines):
|
|
line = lines[i]
|
|
stripped = line.strip()
|
|
if not stripped:
|
|
i += 1
|
|
continue
|
|
if stripped.startswith("# "):
|
|
h = doc.add_heading(stripped[2:], level=1)
|
|
for run in h.runs:
|
|
run.font.color.rgb = RGBColor(0, 0, 0)
|
|
i += 1
|
|
continue
|
|
if stripped.startswith("## "):
|
|
h = doc.add_heading(stripped[3:], level=2)
|
|
for run in h.runs:
|
|
run.font.color.rgb = RGBColor(0, 0, 0)
|
|
i += 1
|
|
continue
|
|
if stripped.startswith("### "):
|
|
h = doc.add_heading(stripped[4:], level=3)
|
|
for run in h.runs:
|
|
run.font.color.rgb = RGBColor(0, 0, 0)
|
|
i += 1
|
|
continue
|
|
if "|" in stripped and i + 1 < len(lines) and re.match(r"\s*\|[-|: ]+\|", lines[i + 1]):
|
|
table_lines = []
|
|
while i < len(lines) and "|" in lines[i]:
|
|
table_lines.append(lines[i])
|
|
i += 1
|
|
add_md_table(doc, table_lines)
|
|
continue
|
|
if re.match(r"^\d+\.\s", stripped):
|
|
p = doc.add_paragraph(style="List Number")
|
|
content = re.sub(r"^\d+\.\s", "", stripped)
|
|
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
|
|
run = p.add_run(content)
|
|
run.font.size = Pt(10)
|
|
run.font.name = "Times New Roman"
|
|
i += 1
|
|
continue
|
|
if stripped.startswith("- "):
|
|
p = doc.add_paragraph(style="List Bullet")
|
|
content = stripped[2:]
|
|
content = re.sub(r"\*\*(.+?)\*\*", r"\1", content)
|
|
run = p.add_run(content)
|
|
run.font.size = Pt(10)
|
|
run.font.name = "Times New Roman"
|
|
i += 1
|
|
continue
|
|
# Regular paragraph
|
|
para_lines = [stripped]
|
|
i += 1
|
|
while i < len(lines):
|
|
nxt = lines[i].strip()
|
|
if (
|
|
not nxt
|
|
or nxt.startswith("#")
|
|
or nxt.startswith("|")
|
|
or nxt.startswith("- ")
|
|
or re.match(r"^\d+\.\s", nxt)
|
|
):
|
|
break
|
|
para_lines.append(nxt)
|
|
i += 1
|
|
para_text = " ".join(para_lines)
|
|
para_text = re.sub(r"\*\*\*(.+?)\*\*\*", r"\1", para_text)
|
|
para_text = re.sub(r"\*\*(.+?)\*\*", r"\1", para_text)
|
|
para_text = re.sub(r"\*(.+?)\*", r"\1", para_text)
|
|
para_text = re.sub(r"`(.+?)`", r"\1", para_text)
|
|
para_text = para_text.replace("$$", "")
|
|
para_text = para_text.replace("---", "\u2014")
|
|
|
|
p = doc.add_paragraph()
|
|
p.paragraph_format.space_after = Pt(6)
|
|
run = p.add_run(para_text)
|
|
run.font.size = Pt(10)
|
|
run.font.name = "Times New Roman"
|
|
|
|
_insert_figures(doc, para_text)
|
|
|
|
|
|
def main():
|
|
doc = Document()
|
|
style = doc.styles["Normal"]
|
|
style.font.name = "Times New Roman"
|
|
style.font.size = Pt(10)
|
|
|
|
# Title page
|
|
p = doc.add_paragraph()
|
|
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
p.paragraph_format.space_after = Pt(12)
|
|
run = p.add_run(
|
|
"Automated Identification of Non-Hand-Signed Auditor Signatures\n"
|
|
"in Large-Scale Financial Audit Reports:\n"
|
|
"A Dual-Descriptor Framework with Three-Method Convergent Thresholding"
|
|
)
|
|
run.font.size = Pt(16)
|
|
run.font.name = "Times New Roman"
|
|
run.bold = True
|
|
|
|
# IEEE Access uses single-anonymized review: author / affiliation
|
|
# / corresponding-author block must appear on the title page in the
|
|
# final submission. Fill these placeholders with real metadata
|
|
# before submitting the generated DOCX.
|
|
p = doc.add_paragraph()
|
|
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
p.paragraph_format.space_after = Pt(6)
|
|
run = p.add_run("[AUTHOR NAMES — fill in before submission]")
|
|
run.font.size = Pt(11)
|
|
|
|
p = doc.add_paragraph()
|
|
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
p.paragraph_format.space_after = Pt(6)
|
|
run = p.add_run("[Affiliations and corresponding-author email — fill in before submission]")
|
|
run.font.size = Pt(10)
|
|
run.italic = True
|
|
|
|
p = doc.add_paragraph()
|
|
p.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
|
p.paragraph_format.space_after = Pt(20)
|
|
run = p.add_run("Target journal: IEEE Access (Regular Paper, single-anonymized review)")
|
|
run.font.size = Pt(10)
|
|
run.italic = True
|
|
|
|
for section_file in SECTIONS:
|
|
filepath = PAPER_DIR / section_file
|
|
if filepath.exists():
|
|
process_section(doc, filepath)
|
|
else:
|
|
print(f"WARNING: missing section file: {filepath}")
|
|
|
|
doc.save(str(OUTPUT))
|
|
print(f"Saved: {OUTPUT}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|