T

gbanyanandClaude Opus 4.7 12f716ddf1 Paper A v3.5: resolve codex round-4 residual issues

Fully addresses the partial-resolution / unfixed items from codex
gpt-5.4 round-4 review (codex_review_gpt54_v3_4.md):

Critical
- Table XI z/p columns now reproduce from displayed counts. Earlier
  table had 1-4-unit transcription errors in k values and a fabricated
  cos > 0.9407 calibration row; both fixed by rerunning Script 24
  with cos = 0.9407 added to COS_RULES and copying exact values from
  the JSON output.
- Section III-L classifier now defined entirely in terms of the
  independent-minimum dHash statistic that the deployed code (Scripts
  21, 23, 24) actually uses; the legacy "cosine-conditional dHash"
  language is removed. Tables IX, XI, XII, XVI are now arithmetically
  consistent with the III-L classifier definition.
- "0.95 not calibrated to Firm A" inconsistency reconciled: Section
  III-H now correctly says 0.95 is the whole-sample Firm A P95 of the
  per-signature cosine distribution, matching III-L and IV-F.

Major
- Abstract trimmed to 246 words (from 367) to meet IEEE Access 250-word
  limit. Removed "we break the circularity" overclaim; replaced with
  "report capture rates on both folds with Wilson 95% intervals to
  make fold-level variance visible".
- Conclusion mirrors the Abstract reframe: 70/30 split documents
  within-firm sampling variance, not external generalization.
- Introduction no longer promises precision / F1 / EER metrics that
  Methods/Results don't deliver; replaced with anchor-based capture /
  FAR + Wilson CI language.
- Section III-G within-auditor-year empirical-check wording corrected:
  intra-report consistency (IV-H.3) is a different test (two co-signers
  on the same report, firm-level homogeneity) and is not a within-CPA
  year-level mixing check; the assumption is maintained as a bounded
  identification convention.
- Section III-H "two analyses fully threshold-free" corrected to "only
  the partner-level ranking is threshold-free"; longitudinal-stability
  uses 0.95 cutoff, intra-report uses the operational classifier.

Minor
- Impact Statement removed from export_v3.py SECTIONS list (IEEE Access
  Regular Papers do not have a standalone Impact Statement). The file
  itself is retained as an archived non-paper note for cover-letter /
  grant-report reuse, with a clear archive header.
- All 7 previously unused references ([27] dHash, [31][32] partner-
  signature mandates, [33] Taiwan partner rotation, [34] YOLO original,
  [35] VLM survey, [36] Mann-Whitney) are now cited in-text:
    [27] in Methodology III-E (dHash definition)
    [31][32][33] in Introduction (audit-quality regulation context)
    [34][35] in Methodology III-C/III-D
    [36] in Results IV-C (Mann-Whitney result)

Updated Script 24 to include cos = 0.9407 in COS_RULES so Table XI's
calibration-fold P5 row is computed from the same data file as the
other rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-21 12:23:03 +08:00

paper

Paper A v3.5: resolve codex round-4 residual issues

2026-04-21 12:23:03 +08:00

signature_analysis

Paper A v3.5: resolve codex round-4 residual issues

2026-04-21 12:23:03 +08:00

signature-comparison

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_results

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

.gitignore

Add Deloitte distribution & independent dHash analysis scripts

2026-04-20 21:34:24 +08:00

check_rejected_for_missing.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

COMMIT_SUMMARY.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

CURRENT_STATUS.md

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

extract_handwriting.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_pages_from_csv.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_hybrid.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_paddleocr_improved.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

extract_signatures_vlm.py

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

extract_signatures_yolo.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

HOW_TO_CONTINUE.txt

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

NEW_SESSION_HANDOFF.md

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

NEW_SESSION_PROMPT.txt

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

paddleocr_client.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

paddleocr_server_v5.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

PADDLEOCR_STATUS.md

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

PP_OCRV5_RESEARCH_FINDINGS.md

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

PROJECT_DOCUMENTATION.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README_hybrid_extraction.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README_page_extraction.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

README.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

SAM3_RESEARCH_FINDINGS.md

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

SESSION_CHECKLIST.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

SESSION_INIT.md

Add hybrid signature extraction with name-based verification

2025-10-26 23:39:52 +08:00

test_mask_and_detect.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_opencv_advanced.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

test_opencv_separation.py

Complete OpenCV Method 3 implementation with 86.5% handwriting retention

2025-11-27 10:35:46 +08:00

test_paddleocr_client.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_paddleocr.py

Add PaddleOCR masking and region detection pipeline

2025-10-28 22:28:18 +08:00

test_pp_ocrv5_api.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_v4_full_pipeline.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

test_v5_full_pipeline.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

visualize_v5_results.py

Complete PP-OCRv5 research and v4 vs v5 comparison

2025-11-27 11:21:55 +08:00

yolo_extract_from_index.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

yolo_full_scan.py

Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification

2026-04-06 23:05:33 +08:00

README.md

PDF Signature Extraction System

Automated extraction of handwritten Chinese signatures from PDF documents using hybrid VLM + Computer Vision approach.

Quick Start

Step 1: Extract Pages from CSV

cd /Volumes/NV2/pdf_recognize
source venv/bin/activate
python extract_pages_from_csv.py

Step 2: Extract Signatures

python extract_signatures_hybrid.py

Documentation

PROJECT_DOCUMENTATION.md - Complete project history, all approaches tested, detailed results
README_page_extraction.md - Page extraction documentation
README_hybrid_extraction.md - Hybrid signature extraction documentation

Current Performance

Test Dataset: 5 PDF pages

Signatures expected: 10
Signatures found: 7
Precision: 100% (no false positives)
Recall: 70%

Key Features

✅ Hybrid Approach: VLM name extraction + CV detection + VLM verification ✅ Name-Based: Signatures saved as signature_周寶蓮.png ✅ No False Positives: Name-specific verification filters out dates, text, stamps ✅ Duplicate Prevention: Only one signature per person ✅ Handles Both: PDFs with/without text layer

File Structure

extract_pages_from_csv.py          # Step 1: Extract pages
extract_signatures_hybrid.py       # Step 2: Extract signatures (CURRENT)
README.md                          # This file
PROJECT_DOCUMENTATION.md           # Complete documentation
README_page_extraction.md          # Page extraction guide
README_hybrid_extraction.md        # Signature extraction guide

Requirements

Python 3.9+
PyMuPDF, OpenCV, NumPy, Requests
Ollama with qwen2.5vl:32b model
Ollama instance: http://192.168.30.36:11434

Data

Input: /Volumes/NV2/PDF-Processing/master_signatures.csv (86,073 rows)
PDFs: /Volumes/NV2/PDF-Processing/total-pdf/batch_*/
Output: /Volumes/NV2/PDF-Processing/signature-image-output/

Status

✅ Page extraction: Tested with 100 files, working ✅ Signature extraction: Tested with 5 files, 70% recall, 100% precision ⏳ Large-scale testing: Pending ⏳ Full dataset (86K files): Pending

See PROJECT_DOCUMENTATION.md for complete details.