gbanyan 615059a2c1 Paper A v3.10: resolve Opus 4.7 round-9 paper-vs-Appendix-A contradiction
Opus round-9 review (paper/opus_final_review_v3_9.md) dissented from
Gemini round-7 Accept and aligned with codex round-8 Minor, but for a
DIFFERENT issue all prior reviewers missed: the paper's main text in
four locations flatly claimed the BD/McCrary accountant-level null
"persists across the Appendix-A bin-width sweep", yet Appendix A
Table A.I itself documents a significant accountant-level cosine
transition at bin 0.005 with |Z_below|=3.23, |Z_above|=5.18 (both
past 1.96) located at cosine 0.980 --- on the upper edge of our two
threshold estimators' convergence band [0.973, 0.979]. This is a
paper-to-appendix contradiction that a careful reviewer would catch
in 30 seconds.

BLOCKER B1: BD/McCrary accountant-level claim softened across all
four locations to match what Appendix A Table A.I actually reports:
- Results IV-D.1 (lines 85-86): rewritten to say the null is not
  rejected at 2/3 cosine bin widths and 2/3 dHash bin widths, with
  the one cosine transition at bin 0.005 sitting on the upper edge
  of the convergence band and the one dHash transition at |Z|=1.96.
- Results IV-E Table VIII row (line 145): "no transition / no
  transition" changed to "0.980 at bin 0.005 only; null at 0.002,
  0.010" / "3.0 at bin 1.0 only ( |Z|=1.96); null at 0.2, 0.5".
- Results IV-E line 130 (Third finding): "does not produce a
  significant transition (robust across bin-width sweep)" replaced
  with "largely null at the accountant level --- no significant
  transition at 2/3 cosine bin widths and 2/3 dHash bin widths,
  with the one cosine transition at bin 0.005 sitting at cosine
  0.980 on the upper edge of the convergence band".
- Results IV-E line 152 (Table VIII synthesis paragraph): matched
  reframing.
- Discussion V-B (line 27): "does not produce a significant
  transition at the accountant level either" -> "largely null at
  the accountant level ... with the one cosine transition on the
  upper edge of the convergence band".
- Conclusion (line 16): matched reframing with power caveat
  retained.

MAJOR M1: Related Work L67 stale "well suited to detecting the
boundary between two generative mechanisms" framing (residue from
pre-demotion drafts) replaced with a local-density-discontinuity
diagnostic framing that matches the rest of the paper and flags
the signature-level bin-width sensitivity + accountant-level rarity
as documented in Appendix A.

MAJOR M2: Table XII orphaned in-text anchor --- Table XII is defined
inside IV-G.3 but had no in-text "Table XII reports ..." pointer at
its presentation location. Added a single sentence before the table
comment.

MINOR m1: Section IV-I.1 "4 of 30,000+ Firm A documents, 0.01%"
replaced with the exact "4 of 30,226 Firm A documents, 0.013%".

MINOR m2: Section IV-E "the two-dimensional two-component GMM"
wording ambiguity (reader might confuse with the already-selected
K*=3 GMM from BIC) replaced with explicit "a separately fit
two-component 2D GMM (reported as a cross-check on the 1D
accountant-level crossings)".

MINOR m3: Section IV-D L59 "downstream all-pairs analyses
(Tables XII, XVIII)" misnomer --- Table XII is per-signature
classifier output not all-pairs; Table XVIII's all-pairs are over
~16M pairs not 168,740. Replaced with an accurate list:
"same-CPA per-signature best-match analyses (Tables V and XII, and
the Firm-A per-signature rows of Tables XIII and XVIII)".

MINOR m4: Methodology III-H L156 "the validation role is played by
... the held-out Firm A fold" slightly overclaims what the held-out
fold establishes (the fold-level rates differ by 1-5 pp with
p<0.001). Parenthetical hedge added: "(which confirms the qualitative
replication-dominated framing; fold-level rate differences are
disclosed in Section IV-G.2)".

Also add:
- paper/opus_final_review_v3_9.md (Opus 4.7 max-effort review)
- paper/gemini_review_v3_8.md (Gemini round-7 Accept verdict, was
  missing from prior commit)

Abstract remains 243 words (under IEEE Access 250 limit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:25:04 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 3.4 MiB
Languages
Python 100%