gbanyan e36c49d2d8 Add Phase 4 prose draft v1 (Abstract + I + II + V + VI)
Phase 4 first-pass draft replacing the v3.20.0 Abstract,
§I Introduction, §II Related Work, §V Discussion, and §VI
Conclusion blocks with the Big-4 reframed v4.0 prose. Single
consolidated file at paper/v4/paper_a_prose_v4_phase4.md.

Structure:
  Abstract  (~235 words, IEEE Access target <= 250)
  §I Introduction  (8-item contributions list updated for v4)
  §II Related Work  (mostly inherited; LOOO citation added)
  §V Discussion  (7 sub-sections: A-G covering distinct-problem
                  framing, accountant-level multimodality,
                  Firm A as templated-end case study, K=2
                  firm-mass conflation, K=3 reproducible shape,
                  three-score internal-consistency, pixel-
                  identity + inter-CPA validation, limitations)
  §VI Conclusion + Future Work  (4 future directions)

Key reframing decisions baked into the prose:
  - Abstract leads with Big-4 scope + dip-test multimodality +
    K=3 reproducibility + three-score convergence + 0% miss
    rate + full-dataset robustness.
  - §I positions the Big-4 sub-corpus scope as the
    methodologically privileged calibration unit ("smallest
    tested scope at which a finite-mixture model is
    statistically supportable").
  - §I-Contribution-4: Big-4 scope as substantive methodological
    finding (was v3.x "percentile-anchored operational
    threshold").
  - §I-Contribution-5: K=3 mixture as descriptive (was v3.x
    "distributional characterisation" framing).
  - §I-Contribution-6: three-score convergent internal-
    consistency (NEW in v4).
  - §I-Contribution-8: full-dataset robustness as light
    secondary scope (NEW in v4).
  - §V-D: explicit "K=2 is firm-mass driven; K=3 is
    reproducible in shape" framing — preempts the LOOO
    reviewer attack vector codex round 23 first flagged.
  - §V-G Limitations: seven explicit limitations including no
    signature-level hand-signed ground truth, pixel-identity
    conservative subset, MC band not separately v4-validated.
  - §VI Future Work: four directions including a Paper B
    placeholder for audit-quality companion analysis.

The technical §III v6 + §IV v3.2 are the foundation; this Phase
4 draft aligns the narrative with the codex-converged
methodology and results.

6 close-out items flagged at end of file (word-count check,
contribution count, LOOO citation, limitations grouping, Paper B
cross-ref, draft note stripping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:46:19 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 6.9 MiB
Languages
Python 100%