ef0e417257600cb7617c9520379e535318b41af4
Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues;
codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught
one additional cosine-P95 ambiguity Opus missed (methodology L255).
Total 12 text-only edits across 5 files.
MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite
the v3.12-corrected Section III-L but still wrote "P95" (self-
contradiction). Fix: methodology L165 and results L247 both restated
as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5%
complement spelled out.
MINOR findings and fixes:
- m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2
L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually
pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both
sites now say "every auditor-year ... across all firms."
- m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21
now add "of 180 registered CPAs; 178 after excluding two with
disambiguation ties, Section IV-G.2" parenthetical to avoid the
misleading 180−171=9 reading.
- m3 IV-H.1 A2 citation: results L286 now explicitly invokes the
A2 within-year label-uniformity convention (Section III-G) when
reading the left-tail share as a partner-level "minority of hand-
signers."
- m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H
→ Section III-L anchor, and added explicit note that the 0.95
heuristic is a whole-sample anchor while Table XI thresholds are
calibration-fold-derived (cosine P5 = 0.9407).
- m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap:
results L406 now explains the 4-report difference (XVI restricts
to both-signers-Firm-A single-firm two-signer reports; XVII counts
at-least-one-Firm-A signer under the 84,386-document cohort).
- m6 Methodology L156 "four independent quantitative analyses"
actually enumerated 6 items: rephrased as "three primary
independent quantitative analyses plus a fourth strand comprising
three complementary checks."
- m7 Abstract "cluster into three groups" restored the "smoothly-
mixed" qualifier to match Discussion V-B and Conclusion L17.
- Codex-caught residue at methodology L255 ("Median, 1st percentile,
and 95th percentile of signature-level cosine/dHash distributions")
grammatically applied P95 to cosine too. Rewrote as
"cosine median, P1, and P5 (lower-tail) and dHash_indep median
and P95 (upper-tail)" matching Table XI L233 exactly.
No re-computation. All tables (IV-XVIII) and Appendix A numbers
unchanged. Abstract at 249/250 words after smoothly-mixed qualifier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%