b884d395442939be8b92b0bcbf689fe8e43a2794
Addresses round-1 findings from all three AI reviewers in a single pass. Substantive empirical content unchanged; fixes are factual corrections, terminology consistency, and table-numbering hygiene. Opus M3 (Abstract-level factual misstatement): "98-100% of inter-CPA collisions within source firm" repeated in Abstract / §I body / §I item 6 / §V-C / §V-G limitation 2 / §VI item 4 / §VI Future Work conflated the same-pair joint rate (97.0-99.96%) with the any-pair deployed rule rate (76.7-98.8% across Firms A/B/C/D — Firm A 98.8, B 76.7, C 83.7, D 77.4 from Table XXV). Replaced with the actual any-pair range and explicit same-pair sub-range. Removed §V-C's "regardless of which Big-4 firm is the source" — within-firm concentration is firm-dependent. Opus M1 (§IV K=3 mechanism-label reversion): §IV silently regressed to v3.x "C1 hand-leaning / C2 mixed / C3 replicated" naming that §III-J line 90 explicitly retires post-composition-decomposition. Replaced in Tables IX/X/XIV/XVI/XVII column headers and §IV-F / §IV-H / §IV-J / §IV-K prose. New convention matches §III-J: - C1 (hand-leaning) -> C1 (low-cos / high-dHash) - C2 (mixed) -> C2 (central) - C3 (replicated) -> C3 (high-cos / low-dHash) - "hand-leaning rate" -> "less-replication-dominated rate" "Replicated class" retained where it refers to byte-identical ground truth (line 143/153 — actual byte-level reuse, not K=3 mechanism inference). Opus M4 (§V duplicate G heading): Phase 4 prose §V had "G. Pixel-Identity..." at line 105 and "G. Limitations" at line 109. Renamed second heading to "H. Limitations". Opus M2 + Gemini Table XV-B (table-numbering cascade): Renamed Table XV-B to Table XIX, then cascaded XIX -> XX -> ... -> XXV -> XXVI to keep sequential integer numbering. Cross-reference at §IV-J also updated. No cross-refs to these tables exist outside §IV (verified by grep against §III + Phase 4 prose). Gemini sample-size footnote (Table XV): expanded the source note to explicitly explain the 150,442 (descriptor-complete) vs 150,453 (vector-complete) distinction across §IV sub-sections and point back to §III-G sample-size reconciliation. §III prose softening (lines 99, 283): "nearly all (98%)" framing that read the Firm A rate as representative of all four Big-4 firms replaced with the per-firm any-pair / same-pair breakdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%