d3ddf746f4555a68072ec2dacf5a455d6334033d
Closes the substantive net-new findings Opus round-2 surfaced. All
fixes are structural or disclosure improvements; no empirical
content changes.
N1 — Denominator inconsistency disclosure: §IV-M.4 per-firm D2 ICCR
listing (line 325) now explains the $n = 19{,}501$ Firm C
denominator versus §IV-J Table XIX's single-firm-only $19{,}122$.
The 379 mixed-firm PDFs all resolve to Firm C under Script 45's
mode-of-firms (majority firm) tie-break — empirically Firm C is
the majority firm in every mixed-firm PDF, not a tie-break
artefact. Footnote reconciles both totals (75,233 vs 74,854).
N2 — §III-M validation table completeness: composition-decomposition
diagnostic (§III-I.4; Scripts 39b–39e) — the foundational v4
evidence cited in Abstract / §I item 4 / §VI item 1 — added as
the first row of the §III-M validation table. Updated:
- §I item 8 (Phase 4 line 57): "nine partial-evidence
diagnostics" → "ten partial-evidence diagnostics (§III-M
Table XXVII)"
- §VI item 8 (Phase 4 line 147): "nine-tool unsupervised-
validation collection (§III-M)" → "ten-tool unsupervised-
validation collection (§III-M Table XXVII)"
- Phase 4 internal draft note still says "nine-tool" but is
internal-strip-at-splice; deliberately not edited.
N3 — Table number assigned: §III-M validation table is now
Table XXVII (continues sequential numbering after §IV-M.6's
Table XXVI). Caption: "Ten-tool unsupervised-validation
collection with disclosed untested assumptions."
N4 — Cross-firm hit matrix assumption row rewritten: replaced the
"None — direct descriptive observation" understatement with the
actual dependency disclosure — same-pair joint event yields
97.0–99.96% within-firm at all four firms versus any-pair
76.7–98.8% — plus the §IV-M.4 mode-of-firms tie-break
cross-reference.
Net result: all three substantive Opus round-2 net-new findings
plus N4 closed. N5 (firm-dependent within-firm violation in §V-H)
and N6 (§IV-I stub cross-reference) deferred as low-priority
optional copy-edits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%