gbanyan 623eb4cd4b Paper A v3.19.1: address codex partner-redpen audit residual ("upper bound" wording)
Codex GPT-5.5 cross-verified the Gemini partner red-pen audit
(paper/codex_partner_redpen_audit_v3_19_0.md) and downgraded item (j) --
the BIC strict-3-component upper-bound framing -- from RESOLVED to
IMPROVED, because the "upper bound" wording the partner originally
red-circled in v3.17 still survived in two methodology sentences and one
Table VI row label, even though Section IV-D.3 had been retitled
"A Forced Fit" in v3.18.

This commit closes that residual:

- Methodology III-I.2: "the 2-component crossing should be treated as
  an upper bound rather than a definitive cut" -> "we report the
  resulting crossing only as a forced-fit descriptive reference and do
  not use it as an operational threshold".
- Methodology III-I.4: "should be read as an upper bound rather than a
  definitive cut" -> "reported only as a descriptive reference rather
  than as an operational threshold".
- Table VI row "0.973 (signature-level Beta/KDE upper bound)" relabelled
  to "0.973 (signature-level Beta/KDE forced-fit reference)" to match
  the IV-D.3 "Forced Fit" framing.
- reference_verification_v3.md header updated so the [5] entry reads as
  an audit trail of a fix already applied (v3.18 reference list reflects
  every correction) rather than as an active major problem.
- Rebuild Paper_A_IEEE_Access_Draft_v3.docx.

Also commits the codex partner-redpen audit artifact so the disagreement
trail with Gemini is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:05:39 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 6.9 MiB
Languages
Python 100%