bc36dcc2b64eb71ae6cac11e0ddf80cc55ce34ea
Phase 1.6 (G2 path) script. Tests whether three INDEPENDENT
statistical approaches converge on the same Big-4 CPA ranking:
1. K=3 GMM cluster posterior P_C1 (hand-leaning)
-- from full Big-4 K=3 fit (Script 37 baseline).
2. Reverse-anchor directional score
-- non-Big-4 (n=249, mid/small firms only) as the
reference Gaussian; -cos_left_tail_pct as score.
-- Strict separation: no Big-4 CPA in the reference.
3. Paper A v3.x operational rule per-CPA hand_frac
-- (cos > 0.95 AND dh <= 5) failure rate per CPA.
Pairwise Spearman correlations:
p_c1 vs paperA_hand_frac rho = +0.9627 (p < 1e-248)
reverse_anchor vs paperA_hand_frac rho = +0.8890 (p < 1e-149)
p_c1 vs reverse_anchor rho = +0.8794 (p < 1e-142)
Verdict: CONVERGENCE_STRONG (all 3 |rho| >= 0.7).
Per-firm consistency across lenses:
Firm n C1% C3% E[P_C1] E[rev] E[hand]
FirmA 171 0.00% 82.46% 0.007 -0.973 0.193
KPMG 112 8.93% 0.00% 0.141 -0.820 0.696
PwC 102 23.53% 0.98% 0.311 -0.767 0.790
EY 52 11.54% 1.92% 0.241 -0.713 0.761
Same monotone ordering by all three metrics:
Firm A < KPMG < EY ~= PwC on hand-leaning.
Implication for v4.0: methodology paper now has THREE
independent lines of evidence converging on the same population
structure -- a much harder thing for a reviewer to dismiss
than any single lens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%