fcce58aff0af9eea88d2e3f7fe750a22ffd7c4e0
Gemini round-6 (paper/gemini_review_v3_7.md) gave Minor Revision but
flagged three issues that five rounds of codex review had missed.
This commit addresses all three.
BLOCKER: Accountant-level BD/McCrary null is a power artifact, not
proof of smoothness (Gemini Issue 1)
- At N=686 accountants the BD/McCrary test has limited statistical
power; interpreting a failure-to-reject as affirmative proof of
smoothness is a Type II error risk.
- Discussion V-B: "itself diagnostic of smoothness" replaced with
"failure-to-reject rather than a failure of the method ---
informative alongside the other evidence but subject to the power
caveat in Section V-G".
- Discussion V-G (Sixth limitation): added a power-aware paragraph
naming N=686 explicitly and clarifying that the substantive claim
of smoothly-mixed clustering rests on the JOINT weight of dip
test + BIC-selected GMM + BD null, not on BD alone.
- Results IV-D.1 and IV-E: reframe accountant-level null as
"consistent with --- not affirmative proof of" clustered-but-
smoothly-mixed, citing V-G for the power caveat.
- Appendix A interpretation paragraph: explicit inferential-asymmetry
sentence ("consistency is what the BD null delivers, not
affirmative proof"); "itself evidence for" removed.
- Conclusion: "consistent with clustered but smoothly mixed"
rephrased with explicit power caveat ("at N = 686 the test has
limited power and cannot affirmatively establish smoothness").
MAJOR: Table X FRR / EER was tautological reviewer-bait
(Gemini Issue 2)
- Byte-identical positive anchor has cosine approx 1 by construction,
so FRR against that subset is trivially 0 at every threshold
below 1 and any EER calculation is arithmetic tautology, not
biometric performance.
- Results IV-G.1: removed EER row; dropped FRR column from Table X;
added a table note explaining the omission and directing readers
to Section V-F for the conservative-subset discussion.
- Methodology III-K: removed the EER / FRR-against-byte-identical
reporting clause; clarified that FAR against inter-CPA negatives
is the primary reported quantity.
- Table X is now FAR + Wilson 95% CI only, which is the quantity
that actually carries empirical content on this anchor design.
MINOR: Document-level worst-case aggregation narrative (Gemini
Issue 3) + 15-signature delta (Gemini spot-check)
- Results IV-I: added two sentences explicitly noting that the
document-level percentages reflect the Section III-L worst-case
aggregation rule (a report with one stamped + one hand-signed
signature inherits the most-replication-consistent label), and
cross-referencing Section IV-H.3 / Table XVI for the mixed-report
composition that qualifies the headline percentages.
- Results IV-D: added a one-sentence footnote explaining that the
15-signature delta between the Table III CPA-matched count
(168,755) and the all-pairs analyzed count (168,740) is due to
CPAs with exactly one signature, for whom no same-CPA pairwise
best-match statistic exists.
Abstract remains 243 words, comfortably under the IEEE Access
250-word cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PDF Signature Extraction System
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.
Quick Start
Step 1: Extract Pages from CSV
cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py
Step 2: Extract Signatures
python extract_signatures_hybrid.py
Documentation
- PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
- README_page_extraction.md - Page extraction documentation
- README_hybrid_extraction.md - Hybrid signature extraction documentation
Current Performance
Test Dataset: 5 PDF pages
- Signatures expected: 10
- Signatures found: 7
- Precision: 100% (no false positives)
- Recall: 70%
Key Features
✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification
✅ Name-Based: Signatures saved as signature_周寶蓮.png
✅ No False Positives: Name-specific verification filters out dates, text, stamps
✅ Duplicate Prevention: Only one signature per person
✅ Handles Both: PDFs with/without text layer
File Structure
extract_pages_from_csv.py # Step 1: Extract pages
extract_signatures_hybrid.py # Step 2: Extract signatures (CURRENT)
README.md # This file
PROJECT_DOCUMENTATION.md # Complete documentation
README_page_extraction.md # Page extraction guide
README_hybrid_extraction.md # Signature extraction guide
Requirements
- Python 3.9+
- PyMuPDF, OpenCV, NumPy, Requests
- Ollama with qwen2.5vl:32b model
- Ollama instance: http://192.168.30.36:11434
Data
- Input:
/Volumes/NV2/PDF-Processing/master_signatures.csv(86,073 rows) - PDFs:
/Volumes/NV2/PDF-Processing/total-pdf/batch_*/ - Output:
/Volumes/NV2/PDF-Processing/signature-image-output/
Status
✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending
See PROJECT_DOCUMENTATION.md for complete details.
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Languages
Python
100%