T

gbanyanandClaude Opus 4.7 ce33156238 Apply codex round-23 corrections: §IV v3 + §III v4

Codex round 23 returned Major Revision on §IV v2: 6 Major + 6
Minor + 5 Editorial findings. Codex confirmed the spike-script
provenance is mostly sound -- no scripts needed rerunning -- so
v3 applies presentation-level fixes only.

Decisions baked in:
  - Anonymisation: maintain Firm A-D pseudonyms throughout the
    manuscript body; remove (Deloitte) / (KPMG) / (PwC) / (EY)
    parentheticals from all v4 §IV tables.
  - Table numbering: v4 tables use fresh V-XVIII (plus Table XV-B);
    inherited v3.x tables are cited only as "v3.20.0 Table N" with
    the original v3 number, NOT renumbered into the v4 sequence.

§IV v3 changes:
  1. Detection denominator rewritten: 86,072 VLM-positive / 12
     corrupted / 86,071 YOLO-processed / 85,042 with-detections /
     182,328 signatures (matches v3.x §IV-B exact wording).
  2. All v4 table labels stripped of "(revised:" / "(NEW:"
     prefixes; replaced with clean "Table N. <descriptor>." form.
  3. Real firm names removed from all tables: 4 replace_all edits.
  4. Line 211 MC-ordering claim removed: MC occupancy is no longer
     described as "consistent with the §III-K Spearman convergence"
     because MC fraction is not monotone in per-CPA hand-leaning
     ranking. New language: descriptive only, with Firm D / Firm B
     ordering counterexample stated.
  5. Line 184 81.70% vs 82.46% qualified as "qualitative
     alignment, not like-for-like consistency check" (different
     units: per-signature class vs per-CPA hard cluster).
  6. Line 43 BD-transition "histogram-resolution artefacts"
     softened to "scope-dependent and not used operationally";
     no specific bin-width artefact claim without sensitivity
     sweep evidence.
  7. K=3 LOOO C1 weight drift corrected: 0.025 -> 0.023 (matches
     Script 37 max deviation 0.0235 / rounded 0.023).
  8. Seed coverage in §IV-A updated: "Scripts 32-42" (was
     "Scripts 32-41", missed Script 42).
  9. Low-cosine cutoff inclusivity: cos < 0.837 -> cos <= 0.837
     (matches Script 42 rule definition).
  10. "round-22 Light scope" process note removed from
      manuscript prose in §IV-K.
  11. §IV-L ablation pointer corrected: v3.20.0 §IV-I (was
      §IV-H.3); v3.20.0 Table XVIII clarified as different from
      v4 Table XVIII.
  12. Line 75 "Component recovery verified across Scripts 35,
      37, 38" rewritten: "the full-fit baseline is reproduced
      in Scripts 35, 37, 38" with explicit note that Script 37
      LOOO fold-specific components differ by design.
  13. Line 110 grammar: "This convergent-checks evidence" ->
      "These convergence checks".
  14. Draft note marked "internal -- remove before submission".

§III v4 changes (cross-reference cleanup):
  1. Line 13 cross-reference repaired: "§IV-D, §IV-F, §IV-G"
     (which are now accountant-level v4 analyses) replaced with
     accurate signature-level references (§IV-J for five-way
     counts; §IV-I for inherited inter-CPA FAR).
  2. Line 23 cross-reference repaired: "all §IV results except
     §IV-K" replaced with explicit list of v4-new vs inherited
     sub-sections.
  3. Line 109 cross-reference repaired: moderate-band capture-
     rate evidence cited as "v3.20.0 Tables IX, XI, XII, XII-B"
     (was "§IV-F", which is now Convergent Internal-Consistency
     Checks, not capture-rate).
  4. Line 131 "without recalibration" claim narrowed: §III-K's
     convergent-checks evidence is now scoped to the binary
     high-confidence rule only; the moderate-confidence band,
     style-consistency band, and document-level aggregation
     are retained by reference to v3.20.0 calibration, not
     claimed as v4.0-validated.

Outstanding open questions: 3 procedural items remain (§IV
table numbering finalisation, §IV-A-C content audit, Phase 4
prose); no methodology blockers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 17:03:33 +08:00

.planning

Update STATE.md: Phase 1 complete, Phase 2 awaiting user review

2026-05-12 15:24:03 +08:00

paper

Apply codex round-23 corrections: §IV v3 + §III v4

2026-05-12 17:03:33 +08:00

signature_analysis

Phase 3 close-out: Script 42 + §IV draft v2 (Table XV filled)

2026-05-12 16:45:22 +08:00

signature-comparison

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_results

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

.gitignore

Add Deloitte distribution & independent dHash analysis scripts

2026-04-20 21:34:24 +08:00

check_rejected_for_missing.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

COMMIT_SUMMARY.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

CURRENT_STATUS.md

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

extract_handwriting.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_pages_from_csv.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_hybrid.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_paddleocr_improved.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

extract_signatures_vlm.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_yolo.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

HOW_TO_CONTINUE.txt

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

NEW_SESSION_HANDOFF.md

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

NEW_SESSION_PROMPT.txt

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

paddleocr_client.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

paddleocr_server_v5.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

PADDLEOCR_STATUS.md

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

PP_OCRV5_RESEARCH_FINDINGS.md

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

PROJECT_DOCUMENTATION.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README_hybrid_extraction.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README_page_extraction.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

SAM3_RESEARCH_FINDINGS.md

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

SESSION_CHECKLIST.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

SESSION_INIT.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

test_mask_and_detect.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_opencv_advanced.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

test_opencv_separation.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

test_paddleocr_client.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_paddleocr.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_pp_ocrv5_api.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_v4_full_pipeline.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_v5_full_pipeline.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

visualize_v5_results.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

yolo_extract_from_index.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

yolo_full_scan.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

README.md

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
README_page_extraction.md - Page extraction documentation
README_hybrid_extraction.md - Hybrid signature extraction documentation

Current Performance

Test Dataset: 5 PDF pages

Signatures expected: 10
Signatures found: 7
Precision: 100% (no false positives)
Recall: 70%

Key Features

✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification ✅ Name-Based: Signatures saved as signature_周寶蓮.png ✅ No False Positives: Name-specific verification filters out dates, text, stamps ✅ Duplicate Prevention: Only one signature per person ✅ Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Python 3.9+
PyMuPDF, OpenCV, NumPy, Requests
Ollama with qwen2.5vl:32b model
Ollama instance: http://192.168.30.36:11434

Data

Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.