gbanyan c79329457a Phase 6 manuscript splice (1/2): Abstract / §I / §II / §III spliced
Splices v4 drafts into v3.20.0 master sub-files. Drops the
"paper/v4/" working drafts and lands the v4.0 content in the master
file structure. Internal draft notes / close-out checklists / open-
questions blocks stripped at splice (per round-1 through round-6
deferral).

Abstract (paper_a_abstract_v3.md):
- Replaced v3.20.0 abstract (240w) with v4.0 abstract (247w).

§I Introduction (paper_a_introduction_v3.md):
- Replaced v3.20.0 §I with v4.0 §I (16 paragraphs + 8-item
  contributions list).

§II Related Work (paper_a_related_work_v3.md):
- Inserted v4.0 LOOO addition paragraph after the existing
  finite-mixture paragraph; added refs [42]-[44] to the
  internal reference annotation list.

§III Methodology (paper_a_methodology_v3.md):
- §III-A..F (Pipeline / Data / Page ID / Detection / Features /
  Dual Descriptors): kept v3.20.0 content unchanged.
- §III-G..M: replaced v3.20.0 §III-G..K with v4.0 §III-G..M
  (Unit & Scope / Reference Populations / Distributional
  Diagnostics + composition decomposition / K=3 descriptive /
  Convergent internal-consistency / Anchor-based ICCR L.0-L.7 /
  Validation strategy + Table XXVII ten-tool collection).
- §III-N Data Source & Anonymization: kept v3.20.0 §III-L content,
  renumbered to §III-N (after v4 §III-M).
- §III-E ablation cross-reference: updated "§IV-I" -> "§IV-L" to
  match the renumbered §IV.
- §III-F pixel-identity cross-reference: updated "§III-J" ->
  "§III-K".

Gemini round-2 artifact paper/gemini_review_v4_round2.md also
added (was uncommitted from the parallel-review batch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:35:53 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 7.4 MiB
Languages
Python 100%