5717d61dd459da99bd25249c03cd380c0a1e2cc9
Codex (gpt-5.4) second-round review recommended 'minor revision'. This commit addresses all issues flagged in that review. ## Structural fixes - dHash calibration inconsistency (codex #1, most important): Clarified in Section III-L that the <=5 and <=15 dHash cutoffs come from the whole-sample Firm A cosine-conditional dHash distribution (median=5, P95=15), not from the calibration-fold independent-minimum dHash distribution (median=2, P95=9) which we report elsewhere as descriptive anchors. Added explicit note about the two dHash conventions and their relationship. - Section IV-H framing (codex #2): Renamed "Firm A Benchmark Validation: Threshold-Independent Evidence" to "Additional Firm A Benchmark Validation" and clarified in the section intro that H.1 uses a fixed 0.95 cutoff, H.2 is fully threshold-free, H.3 uses the calibrated classifier. H.3's concluding sentence now says "the substantive evidence lies in the cross-firm gap" rather than claiming the test is threshold-free. - Table XVI 93,979 typo fixed (codex #3): Corrected to 84,354 total (83,970 same-firm + 384 mixed-firm). - Held-out Firm A denominator 124+54=178 vs 180 (codex #4): Added explicit note that 2 CPAs were excluded due to disambiguation ties in the CPA registry. - Table VIII duplication (codex #5): Removed the duplicate accountant-level-only Table VIII comment; the comprehensive cross-level Table VIII subsumes it. Text now says "accountant-level rows of Table VIII (below)". - Anonymization broken in Tables XIV-XVI (codex #6): Replaced "Deloitte"/"KPMG"/"PwC"/"EY" with "Firm A"/"Firm B"/"Firm C"/ "Firm D" across Tables XIV, XV, XVI. Table and caption language updated accordingly. - Table X unit mismatch (codex #7): Dropped precision, recall, F1 columns. Table now reports FAR (against the inter-CPA negative anchor) with Wilson 95% CIs and FRR (against the byte-identical positive anchor). III-K and IV-G.1 text updated to justify the change. ## Sentence-level fixes - "three independent statistical methods" in Methodology III-A -> "three methodologically distinct statistical methods". - "three independent methods" in Conclusion -> "three methodologically distinct methods". - Abstract "~0.006 converging" now explicitly acknowledges that BD/McCrary produces no significant accountant-level discontinuity. - Conclusion ditto. - Discussion limitation sentence "BD/McCrary should be interpreted at the accountant level for threshold-setting purposes" rewritten to reflect v3.3 result that BD/McCrary is a diagnostic, not a threshold estimator, at the accountant level. - III-H "two analyses" -> "three analyses" (H.1 longitudinal stability, H.2 partner ranking, H.3 intra-report consistency). - Related Work White 1982 overclaim rewritten: "consistent estimators of the pseudo-true parameter that minimizes KL divergence" replaces "guarantees asymptotic recovery". - III-J "behavior is close to discrete" -> "practice is clustered". - IV-D.2 pivot sentence "discreteness of individual behavior yields bimodality" -> "aggregation over signatures reveals clustered (though not sharply discrete) patterns". Target journal remains IEEE Access. Output: Paper_A_IEEE_Access_Draft_v3.docx (395 KB). Codex v3.2 review saved to paper/codex_review_gpt54_v3_2.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%