gbanyan 10c82fd446 Apply codex round-26 corrections to Phase 4 prose v2
Codex round 26 returned Major Revision on Phase 4 v1: 9 Major
findings + 12 Minor + reviewer-attack vulnerabilities. v2
applies all flagged corrections.

Abstract changes:
  - "Three independent feature-derived scores" -> "Three
    feature-derived scores ... not statistically independent
    because all three are functions of the same descriptor
    pair". Names the operational output as the inherited
    five-way classifier.
  - Trimmed from 277 to ~245 words to stay within IEEE Access
    250-word limit while keeping all numerical anchors.

§I Introduction:
  - Line 29 cross-ref §III-D -> §III-G through §III-J
    (§III-D was wrong; the methodology lives in §III-G/I/J).
  - Big-4 scope claim narrowed: "neither any single firm pooled
    alone nor the broader full-dataset variant rejects" -> "none
    of the narrower comparison scopes tested in Script 32
    rejects" with explicit enumeration (Firm A pooled alone;
    Firms B+C+D pooled; all non-Firm-A pooled).
  - "Three independent feature-derived scores" -> "Three
    feature-derived scores ... not statistically independent".
  - Contribution 4 "not at narrower scopes" -> "not in the
    narrower comparison scopes tested".
  - Contribution 8 "demonstrating pipeline reproducibility at
    multiple scopes" -> narrowed to "K=3 + box-rule
    rank-convergence reproduces at full n=686; does not
    re-validate operational thresholds / LOOO / five-way / pixel
    identity at the broader scope".
  - "external validation" softened to "annotation-free
    validation" in methodological-safeguards paragraph.
  - "(5)–(8)" pipeline stage list updated with corrected
    section references.
  - "Published box rule" -> "inherited Paper A box rule".
  - Added Big-4 pixel-identity per-firm breakdown (145/8/107/2)
    in §I body for completeness.

§II Related Work:
  - Replaced placeholder with explicit defer-to-master statement:
    v3.20.0 §II is inherited substantively unchanged in the master
    manuscript; only the LOOO addition is reproduced here.
  - "[add citation]" replaced with placeholder references
    [42] Stone 1974, [43] Geisser 1975, [44] Vehtari et al. 2017
    explicitly marked as draft references to be finalised at
    copy-edit time.
  - LOOO addition reframed: composition-sensitivity band on the
    mixture characterisation, not on the operational classifier.

§V Discussion:
  - §V-B "v4.0 inherits and confirms" softened to "v4.0 inherits
    this signature-level reading and remains consistent with it
    (no signature-level diagnostic was newly run in v4)".
  - §V-B "some CPAs are templated, some are hand-leaning, some
    are mixed" rewritten as component-membership wording: "some
    CPAs' observed signatures place their per-CPA means in the
    templated/mixed/hand-leaning region of the descriptor plane".
  - §V-B within-CPA unimodality explanation softened from
    "produces" to "can be jointly consistent" with explicit
    §III-G cross-ref.
  - §V-C Firm A byte-level provenance: 145 pixel-identical
    signatures verified in Script 40; 50 partners / 35 cross-year
    explicitly inherited from v3 / Script 28 not regenerated in
    v4 spikes.
  - §V-C "anchors §IV-H's positive-anchor miss-rate" -> "is the
    largest of the four Big-4 subsets, with full anchor pooling
    Firm A 145, Firm B 8, Firm C 107, Firm D 2".
  - §V-E "published box rule" -> "inherited Paper A box rule";
    "produce the same per-CPA ranking" -> "broadly concordant
    rankings, with residual non-Firm-A disagreement".
  - §V-G limitations expanded from 7 to 12 items: restored the
    5 v3.20.0 inherited limitations (transferred ImageNet
    features, HSV stamp-removal artifacts, longitudinal scan
    confounds, source-exemplar misattribution, legal
    interpretation).
  - §V-G scope limitation: removed unsupported "narrower or
    broader scopes" full-dataset dip-test claim.

§VI Conclusion:
  - Names operational output: "inherited Paper A five-way
    per-signature classifier with worst-case document-level
    aggregation".
  - "Cross-scope pipeline reproducibility" -> "K=3 + box-rule
    rank-convergence reproduces at full n=686; does not
    re-validate operational thresholds, LOOO, five-way classifier,
    or pixel-identity at the broader scope".
  - Future-work direction 3 explicitly qualifies the within-Big-4
    contrast as "accountant-level descriptive features of the K=3
    mixture, not validated mechanism-level claims and not
    currently linked to audit-quality outcomes".

Round 26 closure post-v2:
  - All 9 Major findings: CLOSED in v2 prose body.
  - All 12 Minor findings: CLOSED in v2 prose body.
  - Phase 5 readiness: should now move from Partial to Yes
    pending codex round 27 verification.

Provenance: codex round-26 confirmed 17/17 numerical claims in
Phase 4 v1 (only finding #5, the scope-test wording, was an
overclaim rather than a numerical error). v2 keeps all confirmed
numerics and narrows only the scope-test wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 23:50:09 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 6.9 MiB
Languages
Python 100%