8dddc3b87cf58a8a5832b908d51cd9791ddba96b
Closes the four audit-surfaced concerns from paper/narrative_audit_v4.md plus the Opus round-2 N5 interpretive caveat. All five are prose-level consistency polishings; no empirical or structural changes. Concern A (Phase 4 line 31 / §I body): "Script 39c" provenance for the jittered-dHash claim was less precise than the §III line 59 source-of-truth which (post round-5) attributes the non-Big-4 jittered evidence to a codex-verified read-only spike. Updated §I to: "cosine: Script 39c; jittered-dHash: Script 39d for Big-4 plus codex-verified read-only spike for ten non-Big-4 firms." Concern B (Phase 4 line 81 / §V-B): same jittered-dHash claim without precise provenance. Updated §V-B to match Concern A attribution + §III-I.4 cross-reference. Concern C (§III-K.4 line 149): cross-reference to "v3.x §IV-I corpus-wide version" was stale after v4 §IV-I was shrunk to a reframing stub. Updated to "§III-L.1 (Big-4 v4 sample) and the inherited corpus-wide v3.x version cited at §IV-I". Concern D (Spearman precision): standardized §III-K.1 table at lines 125-127 to 4 decimal places (0.963/0.889/0.879 -> 0.9627/0.8890/0.8794), matching §IV-F Table IX. Prose floor language "rho >= 0.879" preserved across Abstract/§I/§V/§VI since 0.8794 still rounds to 0.879 at 3dp. Opus N5 / §V-H limit 2 nuance: added a sentence interpreting the firm-dependent within-firm violation - Firm A's per-firm ICCR is more contaminated by within-firm sharing than B/C/D's, so the B/C/D rates of 0.09-0.16 are closer to clean specificity, and the Firm A vs B/C/D contrast reflects both genuine heterogeneity AND a firm-dependent proxy-contamination gradient. Audit artifact paper/narrative_audit_v4.md (~200 lines) captures the full cross-section coherence check across Abstract / §I / §III / §IV / §V / §VI: - Abstract -> body mirror audit (12 claims, all aligned) - §I 8 contributions -> §III/§IV/§V/§VI mapping (all aligned) - v3->v4 pivot rhetoric thread (5 nodes, all aligned) - K=3 demotion / ICCR-FAR / numbers consistency: all verified - Splice-readiness gate: 10/12 pass + 2 splice-time mechanical Headline assessment: "Mostly Coherent - submission-ready after 2-3 small patches" (now applied). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%