gbanyan ef0e417257 Paper A v3.13: resolve Opus 4.7 round-12 + codex gpt-5.5 round-13 findings
Opus 4.7 max-effort round-12 on v3.12 found 1 MAJOR + 7 MINOR residues;
codex gpt-5.5 xhigh round-13 cross-verified 11/11 RESOLVED and caught
one additional cosine-P95 ambiguity Opus missed (methodology L255).
Total 12 text-only edits across 5 files.

MAJOR M1 - Cosine P95→P7.5 terminology residue at two sites that cite
the v3.12-corrected Section III-L but still wrote "P95" (self-
contradiction). Fix: methodology L165 and results L247 both restated
as "whole-sample Firm A P7.5 heuristic" with the 92.5%/7.5%
complement spelled out.

MINOR findings and fixes:
- m1 Big-4 scope slip: methodology III-H(b) L166 and results IV-H.2
  L311 said "every Big-4 auditor-year" but IV-H.2 ranking actually
  pools all 4,629 auditor-years across Big-4 and Non-Big-4. Both
  sites now say "every auditor-year ... across all firms."
- m2 178 vs 180 Firm A CPA breakdown: intro L54 and conclusion L21
  now add "of 180 registered CPAs; 178 after excluding two with
  disambiguation ties, Section IV-G.2" parenthetical to avoid the
  misleading 180−171=9 reading.
- m3 IV-H.1 A2 citation: results L286 now explicitly invokes the
  A2 within-year label-uniformity convention (Section III-G) when
  reading the left-tail share as a partner-level "minority of hand-
  signers."
- m4 IV-F L177 cross-ref / fold distinction: corrected Section III-H
  → Section III-L anchor, and added explicit note that the 0.95
  heuristic is a whole-sample anchor while Table XI thresholds are
  calibration-fold-derived (cosine P5 = 0.9407).
- m5 Table XVI (30,222) vs Table XVII (30,226) Firm A count gap:
  results L406 now explains the 4-report difference (XVI restricts
  to both-signers-Firm-A single-firm two-signer reports; XVII counts
  at-least-one-Firm-A signer under the 84,386-document cohort).
- m6 Methodology L156 "four independent quantitative analyses"
  actually enumerated 6 items: rephrased as "three primary
  independent quantitative analyses plus a fourth strand comprising
  three complementary checks."
- m7 Abstract "cluster into three groups" restored the "smoothly-
  mixed" qualifier to match Discussion V-B and Conclusion L17.
- Codex-caught residue at methodology L255 ("Median, 1st percentile,
  and 95th percentile of signature-level cosine/dHash distributions")
  grammatically applied P95 to cosine too. Rewrote as
  "cosine median, P1, and P5 (lower-tail) and dHash_indep median
  and P95 (upper-tail)" matching Table XI L233 exactly.

No re-computation. All tables (IV-XVIII) and Appendix A numbers
unchanged. Abstract at 249/250 words after smoothly-mixed qualifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 21:21:37 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 3.7 MiB
Languages
Python 100%