gbanyan 4ee2efb5bb Add codex GPT-5.5 Phase 5 round-2 cross-check on post-round-2 drafts
Verdict: Minor Revision (corroborates Gemini round-1 and Opus round-1).

Round-1 panel finding closure (codex round-8 audit):
- Codex own round-7: 11 Major + 15 Minor → 21 CLOSED, 4 OPEN/PARTIAL
  (mostly splice items); M6 + new-issue-1 (refs [42]-[44]) SUPERSEDED
  (Gemini was right, codex round-7 was wrong about absence)
- Gemini round-1: 5 Major + 3 Minor all CLOSED in main body
- Opus round-1: M1-M4 CLOSED in manuscript body; some minors open

Provenance verification (independent of Opus):
- Within-firm any-pair from Table XXV: 98.8032 / 76.6529 / 83.7079 /
  77.3723% — Opus arithmetic confirmed
- Same-pair joint: 99.9558 / 97.7011 / 98.1818 / 96.9697% — confirms
  the 97.0-99.96% range
- Pooled Big-4 any-pair ICCR 0.1102 verified from Script 43 report
  (16,578 / 150,453); Wilson 95% half-width 0.00158 reconciles
- Per-pair conditional ICCR 0.234 verified from Script 40b (70 / 299)

Round-2-induced / round-2-exposed concrete blockers (fixable):
1. Abstract now 261 words (M3 fix pushed over <=250 IEEE Access target);
   need 11+ word trim
2. §IV line 177 footnote miscategorizes §IV-M.5 as n=150,442 —
   §IV-M.5 / Tables XXIV-XXV actually use 150,453 vector-complete per
   Script 44 report; only §IV-D through §IV-J use 150,442
3. §IV-I line 161 stale cross-reference: "§IV-M Table XVI" should be
   "§IV-M Tables XXI-XXVI" — XVI is the K=3 firm cross-tab,
   pre-existing error exposed by the cascade

Minor copy-edit residue (not blockers): §III line 131 + §IV Table XI
line 104 "replicated vs not-replicated" binary-collapse label;
internal-note staleness at §III lines 438/445, §IV line 3/370.

No empirical reopening: codex confirms Opus M3 does not invalidate
round-7's Major closures of M2 (Big-4 scope) or M11 (cross-scope
reproducibility). Only round-7 minor reopened: m2 abstract margin.

Phase 5 readiness: Partial — empirical core ready, no new statistical
work required; copy-edit / factual-reference splice blockers remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 17:15:42 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 7.4 MiB
Languages
Python 100%