d2f8673a67a6a837f6f5c5f56ea6f64e8e76ba11
Rewrites Section III-G (Unit of Analysis and Summary Statistics) after self-review identified three logical issues in v3.10: 1. Ordering inversion: the three units are now ordered signature -> auditor-year -> accountant, with auditor-year as the principled middle unit under within-year assumptions and accountant as a deliberate cross-year pooling. 2. Oversold assumption: the old "within-auditor-year no-mixing identification assumption" is split into A1 (pair-detectability, weak statistical, cross-year scope matching the detector) and A2 (within-year label uniformity, interpretive convention). The arithmetic statistics reported in the paper do not require A2; A2 only underwrites interpretive readings (notably IV-H.1's partner- level "minority of hand-signers" framing). 3. Motivation-assumption mismatch: removed the "longitudinal behaviour of interest" framing and explicitly disclaimed across-year homogeneity. Accountant-level coordinates are now described as a pooled observed tendency rather than a time-invariant regime. Propagated implications across Introduction, Discussion, and Results: softened "tends to cluster into a dominant regime" and "directly quantifying the minority of hand-signers" to "pooled observed tendency" / "consistent with within-firm heterogeneity"; rewrote the Limitations fifth point (was "treats all signatures from a CPA as a single class"); added a seventh Limitation acknowledging the source-template edge case; added a per-signature best-match cross-year caveat to Section IV-H.2; softened IV-H.2's "direct consequence" to "consistent with"; reframed pixel-identity anchor as pair-level proof of image reuse (with source-template exception) rather than absolute signature-level positive. Process: self-review (9 findings) -> full-pass fixes -> codex gpt-5.5 xhigh round-10 verification (8 RESOLVED, 1 PARTIAL, 4 MINOR regression findings) -> regression fixes. No re-computation. All tables (IV-XVIII) and Appendix A numbers unchanged. Abstract at 248/250 words. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%