ce3315623822c8cda1be5f19c035bd1715ae9a39
Codex round 23 returned Major Revision on §IV v2: 6 Major + 6
Minor + 5 Editorial findings. Codex confirmed the spike-script
provenance is mostly sound -- no scripts needed rerunning -- so
v3 applies presentation-level fixes only.
Decisions baked in:
- Anonymisation: maintain Firm A-D pseudonyms throughout the
manuscript body; remove (Deloitte) / (KPMG) / (PwC) / (EY)
parentheticals from all v4 §IV tables.
- Table numbering: v4 tables use fresh V-XVIII (plus Table XV-B);
inherited v3.x tables are cited only as "v3.20.0 Table N" with
the original v3 number, NOT renumbered into the v4 sequence.
§IV v3 changes:
1. Detection denominator rewritten: 86,072 VLM-positive / 12
corrupted / 86,071 YOLO-processed / 85,042 with-detections /
182,328 signatures (matches v3.x §IV-B exact wording).
2. All v4 table labels stripped of "(revised:" / "(NEW:"
prefixes; replaced with clean "Table N. <descriptor>." form.
3. Real firm names removed from all tables: 4 replace_all edits.
4. Line 211 MC-ordering claim removed: MC occupancy is no longer
described as "consistent with the §III-K Spearman convergence"
because MC fraction is not monotone in per-CPA hand-leaning
ranking. New language: descriptive only, with Firm D / Firm B
ordering counterexample stated.
5. Line 184 81.70% vs 82.46% qualified as "qualitative
alignment, not like-for-like consistency check" (different
units: per-signature class vs per-CPA hard cluster).
6. Line 43 BD-transition "histogram-resolution artefacts"
softened to "scope-dependent and not used operationally";
no specific bin-width artefact claim without sensitivity
sweep evidence.
7. K=3 LOOO C1 weight drift corrected: 0.025 -> 0.023 (matches
Script 37 max deviation 0.0235 / rounded 0.023).
8. Seed coverage in §IV-A updated: "Scripts 32-42" (was
"Scripts 32-41", missed Script 42).
9. Low-cosine cutoff inclusivity: cos < 0.837 -> cos <= 0.837
(matches Script 42 rule definition).
10. "round-22 Light scope" process note removed from
manuscript prose in §IV-K.
11. §IV-L ablation pointer corrected: v3.20.0 §IV-I (was
§IV-H.3); v3.20.0 Table XVIII clarified as different from
v4 Table XVIII.
12. Line 75 "Component recovery verified across Scripts 35,
37, 38" rewritten: "the full-fit baseline is reproduced
in Scripts 35, 37, 38" with explicit note that Script 37
LOOO fold-specific components differ by design.
13. Line 110 grammar: "This convergent-checks evidence" ->
"These convergence checks".
14. Draft note marked "internal -- remove before submission".
§III v4 changes (cross-reference cleanup):
1. Line 13 cross-reference repaired: "§IV-D, §IV-F, §IV-G"
(which are now accountant-level v4 analyses) replaced with
accurate signature-level references (§IV-J for five-way
counts; §IV-I for inherited inter-CPA FAR).
2. Line 23 cross-reference repaired: "all §IV results except
§IV-K" replaced with explicit list of v4-new vs inherited
sub-sections.
3. Line 109 cross-reference repaired: moderate-band capture-
rate evidence cited as "v3.20.0 Tables IX, XI, XII, XII-B"
(was "§IV-F", which is now Convergent Internal-Consistency
Checks, not capture-rate).
4. Line 131 "without recalibration" claim narrowed: §III-K's
convergent-checks evidence is now scoped to the binary
high-confidence rule only; the moderate-confidence band,
style-consistency band, and document-level aggregation
are retained by reference to v3.20.0 calibration, not
claimed as v4.0-validated.
Outstanding open questions: 3 procedural items remain (§IV
table numbering finalisation, §IV-A-C content audit, Phase 4
prose); no methodology blockers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%