92f1db831a65b496388863e50619147ea09db7cc
Follow-up to Script 36's K=2 UNSTABLE finding. Tests whether K=3's C1 hand-leaning component (~14% weight, cos~0.946, dh~9.17 from Script 35) is firm-mass driven or a real cross-firm sub-population. Result: C1 component shape IS stable across LOOO folds. Fold C1 cos C1 dh C1 weight baseline 0.9457 9.1715 0.143 -FirmA 0.9425 10.1263 0.145 -KPMG 0.9441 9.1591 0.127 -PwC 0.9504 8.4068 0.126 -EY 0.9439 9.2897 0.120 Max drift vs baseline: cos 0.0047, dh 0.955, weight 0.023 -- all within heuristic stability bars (0.01, 1.0, 0.10). Held-out prediction divergence vs Script 35 baseline: Firm A predicted 4.68% vs baseline 0.0% (+4.68 pp) KPMG predicted 7.14% vs baseline 8.9% (-1.76 pp) PwC predicted 36.27% vs baseline 23.5% (+12.77 pp) EY predicted 17.31% vs baseline 11.5% (+5.81 pp) Verdict: P2_PARTIAL. Methodological insight: K=3 disentangles the firm-mass/mechanism confound that broke K=2. C3 (cos~0.983, dh~2.4) absorbs Firm A's templated mass; C1 (cos~0.946, dh~9.17) captures cross-firm hand-leaning. Membership boundary shifts slightly (±5-13 pp) across folds, reflecting honest calibration uncertainty rather than collapse. Implication: v4.0 can pivot to a "characterized cluster structure with bounded reproducibility" framing instead of the original "clean natural threshold" pitch. Honest, defensible, but a different paper than v3.20.0 was building. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%