gbanyan 85cfefe49f Paper A v3.9: resolve codex round-8 regressions (Table XV baseline + cross-refs)
Codex round-8 (paper/codex_review_gpt54_v3_8.md) dissented from
Gemini's Accept and gave Minor Revision because of two real
numerical/consistency issues Gemini's round-7 review missed. This
commit fixes both.

Table XV per-year Firm A baseline-share column corrected
- All 11 yearly values resynced to the authoritative
  reports/partner_ranking/partner_ranking_report.md (per-year
  Deloitte baseline share column):
    2013: 26.2% -> 32.4%  (largest error; codex's test case)
    2014: 27.1% -> 27.8%
    2015: 27.2% -> 27.7%
    2016: 27.4% -> 26.2%
    2017: 27.9% -> 27.2%
    2018: 28.1% -> 26.5%
    2019: 28.2% -> 27.0%
    2020: 28.3% -> 27.7%
    2021: 28.4% -> 28.7%
    2022: 28.5% -> 28.3%
    2023: 28.5% -> 27.4%
- Codex independently verified that the prior 2013 value 26.2% was
  numerically impossible because the underlying JSON places 97 Firm
  A auditor-years in the 2013 top-50% bucket out of 324 total, so
  the full-year baseline must be at least 97/324 = 29.9%.
- All other Table XV columns (N, Top-10% k, in top-10%, share) were
  already correct and unchanged.

Broken cross-references from earlier renumbering repaired
- Methodology III-E: "ablation study (Section IV-F)" pointer
  corrected to "Section IV-J"; the ablation is at Section IV-J
  line 412 in the current Results, while IV-F is now "Calibration
  Validation with Firm A".
- Results Table XVIII note: "per-signature best-match values in
  Tables IV/VI (mean = 0.980)" is orphaned after earlier
  renumbering (Table IV is all-pairs distributional statistics;
  Table VI is accountant-level GMM model selection). Replaced with
  an explicit pointer to "Section IV-D and visualized in Table XIII
  (whole-sample Firm A best-match mean ~ 0.980)". Table XIII is
  the correct container of per-signature best-match mean statistics.

All other Section IV-X cross-references in methodology / results /
discussion were spot-checked and remain correct under the current
section numbering.

With these two surgical fixes, codex's round-8 ranked items (1) and
(2) are cleared. Item (3) was the final DOCX packaging pass (author
metadata fill-in, figure rendering, reference formatting) which is
done manually at submission time and does not affect the markdown.

Deferred items remain deferred:
- Visual-inspection protocol details (codex round-5 item 4)
- General reproducibility appendix (codex round-5 item 6)
Both are defensible for first IEEE Access submission per codex
round-8 assessment, since the manuscript no longer leans on visual
inspection or BD/McCrary as decisive standalone evidence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 14:59:27 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 3.4 MiB
Languages
Python 100%