453f1d87687c2b897930d5998c12238f86430fa8
Script 42 tabulates the §III-L five-way per-signature classifier
output on the Big-4 sub-corpus (n=150,442 signatures classified)
and aggregates to document-level (n=75,233 unique PDFs) under
the worst-case rule.
Per-signature five-way overall (Table XV):
HC 74,593 49.58% high-confidence non-hand-signed
MC 39,817 26.47% moderate-confidence non-hand-signed
HSC 314 0.21% high style consistency
UN 35,480 23.58% uncertain
LH 238 0.16% likely hand-signed
Per-firm five-way (% within firm):
Firm A (Deloitte) HC 81.70%, MC 10.76%, UN 7.42%
Firm B (KPMG) HC 34.56%, MC 35.88%, UN 29.09%
Firm C (PwC) HC 23.75%, MC 41.44%, UN 34.21%
Firm D (EY) HC 24.51%, MC 29.33%, UN 45.65%
Document-level (Table XV-B, NEW):
HC 46,857 62.28%
MC 19,667 26.14%
HSC 167 0.22%
UN 8,524 11.33%
LH 18 0.02%
Total 75,233 unique Big-4 PDFs (single-firm 74,854; mixed-firm 379)
§IV v2 changes vs v1:
- Table XV populated with Script 42 counts
- Table XV-B (NEW): document-level worst-case counts
- Per-firm five-way breakdown (% within firm) added
- Per-firm document-level breakdown added
- Document-level paragraph in §IV-J updated to reference Table XV-B
- Phase 3 close-out checklist: item 1 (Table XV TBD) and item 4
(document-level counts) marked RESOLVED; remaining items reduced
from 5 to 3 (renumbering, content audit, codex open-questions)
The per-firm pattern is consistent with the §III-K Spearman-and-
cluster ordering: Firm A's signatures concentrate in HC (81.7%),
the three non-Firm-A firms have markedly lower HC and substantially
higher Uncertain rates (29-46%), with Firm D having the highest
Uncertain rate of the Big-4 -- consistent with the reverse-anchor
score (§III-K Score 2) ranking Firm D fractionally above Firm C in
the hand-leaning direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%