gbanyan b33e20d479 Rewrite Phase 4 prose v3: Abstract / §I / §V / §VI to match §III v7
Major Phase 4 prose update aligning narrative with the §III v7
anchor-based ICCR framework (codex rounds 29-34):

- Abstract (247 words, under 250 limit): replaced K=3 mixture +
  natural-threshold framing with composition decomposition +
  multi-level ICCR + firm heterogeneity. Positioning as
  specificity-proxy-anchored screening framework.

- §I Introduction:
  * Methodological-design paragraph rewritten (no natural threshold;
    multi-level reporting; per-firm stratification; unsupervised
    disclosure)
  * Two new paragraphs documenting composition decomposition
    overturning distributional path, and anchor-based three-unit
    ICCR calibration
  * Firm heterogeneity + within-firm collision concentration as
    central findings
  * Contribution list rewritten (8 items): composition decomposition
    disproves natural threshold (NEW #4); multi-level ICCR
    calibration (NEW #5); firm heterogeneity quantification (NEW #6);
    K=3 demoted to descriptive partition (#7); multi-tool validation
    ceiling positioning (#8)

- §V Discussion:
  * §V-B retitled "composition-driven multimodality"; 2x2 factorial
    decomposition reported
  * §V-C Firm A reframed: position contrast + within-firm collision
    pattern, not "templated-end calibration anchor"
  * §V-D K=2/K=3 reframed as descriptive firm-compositional
    partitions (no "mechanism boundary" language)
  * §V-E three-score convergence reinterpreted as descriptor-position
    ranking, not hand-leaning mechanism ranking
  * §V-F (new title) Anchor-based multi-level calibration with all
    three units of analysis
  * §V-G expanded to 9 v4-specific limitations (no signature-level
    ground truth; assumption-violation; scope; conservative-subset;
    inherited rule components; deployed-rate excess not TPR; A1
    stipulation; K=3 composition sensitivity; no partner-level
    mechanism attribution) plus 5 inherited limitations

- §VI Conclusion: 8-point contribution list mirroring §I; 4 future
  work directions including within-firm collision-mechanism
  disambiguation and audit-quality companion analysis.

- Header draft-note updated to v3 (post codex rounds 26-34);
  Phase 4 v2 changelog moved to CHANGELOG.md placeholder.

Companion to §III v7 commit 723a3f6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 18:10:04 +08:00

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

Current Performance

Test Dataset: 5 PDF pages

  • Signatures expected: 10
  • Signatures found: 7
  • Precision: 100% (no false positives)
  • Recall: 70%

Key Features

Hybrid Approach: VLM name extraction + CV detection + VLM verification Name-Based: Signatures saved as signature_周寶蓮.png No False Positives: Name-specific verification filters out dates, text, stamps Duplicate Prevention: Only one signature per person Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Data

  • Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
  • PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
  • Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

Page extraction: Tested with 100 files, working Signature extraction: Tested with 5 files, 70% recall, 100% precision Large-scale testing: Pending Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.

S
Description
Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach. 70% recall, 100% precision.
Readme 6.9 MiB
Languages
Python 100%