gbanyan 1dfbc5f000 Paper A v3.15: resolve Gemini 3.1 Pro round-15 Accept-verdict minor polish
Gemini 3.1 Pro round-15 full-paper review of v3.14 returned Accept
with four MINOR polish suggestions. All four applied in this commit.

1. Table XIII column header: "mean cosine" renamed to
   "mean best-match cosine" to match the underlying metric (per-
   signature best-match over the full same-CPA pool) and prevent
   readers from inferring a simpler per-year statistic.

2. Methodology III-L (L284): added a forward-pointer in the first
   threshold-convention note to Section IV-G.3, explicitly confirming
   that replacing the 0.95 round-number heuristic with the nearby
   accountant-level 2D-GMM marginal crossing 0.945 alters aggregate
   firm-level capture rates by at most ~1.2 percentage points. This
   pre-empts a reader who might worry about the methodological
   tension between the heuristic and the mixture-derived convergence
   band.

3. Results IV-I document-level aggregation (L383): "Document-level
   rates therefore bound the share..." rewritten as "represent the
   share..." Gemini correctly noted that worst-case aggregation
   directly assigns (subject to classifier error), so "bound"
   spuriously implies an inequality not actually present.

4. Results IV-G.4 Sanity Sample (L273): "inter-rater agreement with
   the classifier" rewritten as "full human--classifier agreement
   (30/30)". Inter-rater conventionally refers to human-vs-human
   agreement; human-vs-classifier is the correct term here.

No substantive changes; no tables recomputed.

Gemini round-15 verdict was Accept with these four items framed
as nice-to-have rather than blockers; applying them brings v3.15
to a fully polished state before manual DOCX packaging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 01:01:58 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 3.7 MiB
Languages
Python 100%