e1d81e3732c3f64d15eea7ddc462be7a34dfb1fb
Spike for the from-outside-of-firmA branch. Runs the three-method threshold framework (KDE+dip, BD/McCrary, Beta mixture / logit-GMM, 2D-GMM) on three subsets: Subset I big4_non_A KPMG+PwC+EY pooled (266 CPAs, 89.9k sigs) Subset II all_non_A every firm except Firm A (515 CPAs, 108k sigs) Subset III firm_A reference baseline (171 CPAs, 60.4k sigs) Plus pre_2018 / post_2020 time-stratified secondary on subsets I and II. Result: verdict C -- every subset is unimodal at the dip-test level (dip p > 0.76 across the board), including Firm A itself. Time stratification does not recover bimodality. Cross-subset Beta-2 cosine crossings: Firm A 0.977, big4_non_A 0.930, all_non_A 0.938; Paper A's published 0.945 sits between the two mass centers, indicating the published "natural threshold" is effectively a between-firm separator rather than a within-pool mechanism boundary. This finding motivates a follow-up reverse-anchor spike (script 33). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%