gbanyan 51d15b32a5 Paper A v3.2: partner v4 feedback integration (threshold-independent benchmark validation)
Partner v4 (signature_paper_draft_v4) proposed 3 substantive improvements;
partner confirmed the 2013-2019 restriction was an error (sample stays
2013-2023). The remaining suggestions are adopted with our own data.

## New scripts
- Script 22 (partner ranking): ranks all Big-4 auditor-years by mean
  max-cosine. Firm A occupies 95.9% of top-10% (base 27.8%), 3.5x
  concentration ratio. Stable across 2013-2023 (88-100% per year).
- Script 23 (intra-report consistency): for each 2-signer report,
  classify both signatures and check agreement. Firm A agrees 89.9%
  vs 62-67% at other Big-4. 87.5% Firm A reports have BOTH signers
  non-hand-signed; only 4 reports (0.01%) both hand-signed.

## New methodology additions
- III-G: explicit within-auditor-year no-mixing identification
  assumption (supported by Firm A interview evidence).
- III-H: 4th Firm A validation line: threshold-independent evidence
  from partner ranking + intra-report consistency.

## New results section IV-H (threshold-independent validation)
- IV-H.1: Firm A year-by-year cosine<0.95 rate. 2013-2019 mean=8.26%,
  2020-2023 mean=6.96%, 2023 lowest (3.75%). Stability contradicts
  partner's hypothesis that 2020+ electronic systems increase
  heterogeneity -- data shows opposite (electronic systems more
  consistent than physical stamping).
- IV-H.2: partner ranking top-K tables (pooled + year-by-year).
- IV-H.3: intra-report consistency per-firm table.

## Renumbering
- Section H (was Classification Results) -> I
- Section I (was Ablation) -> J
- Tables XIII-XVI new (yearly stability, top-K pooled, top-10% per-year,
  intra-report), XVII = classification (was XII), XVIII = ablation
  (was XIII).

These threshold-independent analyses address the codex review concern
about circular validation by providing benchmark evidence that does not
depend on any threshold calibrated to Firm A itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 01:59:49 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 3.4 MiB
Languages
Python 100%