pdf_signature_extraction/.planning/PROJECT.md

# Taiwan TWSE CPA Signature Authentication

## What This Is

A computer-vision research pipeline that classifies whether the CPA signatures appearing on Taiwan TWSE-listed-company financial reports are hand-signed (親簽) or non-hand-signed (非親簽 — early-period rubber-stamp / scan, or post-2020 firm-level electronic signature systems). The pipeline ingests ~90k PDFs (2013-2023), detects ~182k signatures with YOLOv11n, embeds them with ResNet-50 (ImageNet1K_V2, no fine-tune), and characterises distributional structure with cosine + independent dHash descriptors. Target: a peer-reviewed publication (IEEE Access, A/6 on the NCKU CSIE journal list).

## Core Value

A statistically defensible, **reproducible** thresholding methodology that distinguishes hand-signed from digitally-replicated CPA signatures at the population level, with traceable evidence at every step (DB → script → table → paper claim).

## Requirements

### Validated

<!-- Shipped and confirmed valuable. -->

- ✓ End-to-end pipeline (TWSE MOPS scrape → Qwen2.5-VL prefilter → YOLO detection → ResNet embedding → DB + descriptors) — `signature_analysis/01-19`
- ✓ Independent dHash descriptor for replication detection — Script 14 (v3.x baseline)
- ✓ Accountant-level 3-component GMM characterisation — Script 18/20 (v3.x baseline)
- ✓ Paper A v3.20.0 manuscript (full-dataset framing, partner Jimmy 2026-04-27 substantive review accepted, codex 3-pass verification clean) — commit `53125d1` on `yolo-signature-pipeline`
- ✓ Spike scripts 32-35 confirming Big-4-only scope is methodologically superior — commits `e1d81e3`, `8ac0988`, `55f9f94` on `paper-a-v4-big4`

### Active

<!-- Current scope. Building toward these. -->

**Milestone: Paper A v4.0 — Big-4 reframe (primary scope) + full-dataset robustness (secondary)**

- [ ] Foundation: rerun core scripts on Big-4 subset with `--scope=big4` flag (`/scripts 19, 20, 21, 24, 25`)
- [ ] Methodology rewrite: §III-G/I/J/L re-anchored on dip-test confirmed bimodality and bootstrap-stable Big-4 K=2 GMM (cos=0.975, dh=3.76)
- [ ] Results tables: regenerate Tables IV-XVIII on Big-4 subset; new §IV-K full-dataset secondary
- [ ] Prose rewrite: Abstract / Intro / Discussion / Conclusion with Firm A reframed as "templated end of Big-4" case study (was: hand-signed calibration anchor)
- [ ] AI peer review: ≥3 cross-AI rounds (codex, Gemini 3.x Pro, Opus 4.7) on the v4.0 manuscript
- [ ] Partner Jimmy second review on v4.0 (he proposed this direction; needs sign-off on execution)
- [ ] iThenticate <20%, eCF copyright form, IEEE Access submission portal upload + cover letter

### Out of Scope

<!-- Explicit boundaries. Includes reasoning to prevent re-adding. -->

- **Paper B (audit behaviour / policy implications)** — partner v4 contribution D, deferred to a separate paper after Paper A ships
- **Paper C standalone (reverse-anchor methodology)** — initial 2026-05-12 spike direction, **folded back into Paper A v4.0 §IV-K** as one robustness lens; does not warrant a separate manuscript
- **Mid/small-firm primary scope** — included as full-dataset secondary only; primary scope is Big-4 because dip-test only achieves multimodality at Big-4 level
- **Per-document classifier release as software product** — paper-only deliverable; no API / SaaS layer in scope
- **VLM behavioural interview / IRB study** — removed in v3.4; not coming back

## Context

- **Domain**: Taiwan-listed CPA audit signatures, 2013-2023; 4 Big-4 firms (勤業眾信 Deloitte, 安侯建業 KPMG, 資誠 PwC, 安永 EY) + ~30 mid/small firms
- **Hardware split**: YOLO + ResNet on RTX 4090 (CUDA, deterministic forward inference, fixed seed); statistical analysis on Apple Silicon MPS / CPU
- **Domain expert**: User has practitioner-level CPA-firm knowledge in Taiwan; recognises specific senior-partner names (e.g., 薛明玲 / 周建宏 are known PwC seniors that surfaced in Script 35's C1 cluster)
- **Partner**: 與 partner Jimmy 合作；Jimmy 已提出 Big-4-only 方向，是 v4.0 的觸發者

## Constraints

- **Target journal**: IEEE Access (A/6 on NCKU CSIE list); fits Computer-Vision-applied-to-Audit scope
- **Timeline**: v3.20.0 was already partner-reviewed and DOCX-shipped (2026-05-05). v4.0 reframe will delay submission by ~4-6 weeks but produces a stronger manuscript; partner Jimmy is aware and supportive
- **Reproducibility**: pipeline must run end-to-end on the existing `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` snapshot; no new data ingest in scope
- **AI review provenance**: every empirical claim must be backed by a fresh sqlite/grep against the named script — see `[[feedback-provenance-fabrication]]` memory; Gemini round-19 caught 4 fabricated provenance claims previously

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Use ResNet-50 ImageNet1K_V2 without fine-tune | Reproducibility; avoid label leakage from fine-tuning on the same corpus | ✓ Validated through v3.x |
| Cosine + independent dHash dual descriptor | Cosine catches semantic similarity; independent dHash catches byte-level replication | ✓ Validated |
| Drop SSIM / pixel-pHash from descriptor set | Reviewer-rejected as redundant / fragile | ✓ v3.x rewrite |
| Drop A2 within-year uniformity assumption | Empirically falsified by Script 27 | ✓ v3.14 |
| **Reframe scope to Big-4 only as primary** | Dip-test multimodal only at Big-4 level (p<0.0001); mid/small noise distorted Paper A v3.x's published 0.945/8.10 threshold; partner Jimmy's earlier suggestion empirically confirmed by Scripts 32-35 | — Pending v4.0 |
| Reverse-anchor Paper C → folded into v4.0 §IV-K | Big-4 reframe is the stronger story; reverse-anchor is one of several lenses on the same data, not a standalone paper | ✓ Decided 2026-05-12 |
| Branch strategy: `paper-a-v4-big4` from `from-outside-of-firmA` from `yolo-signature-pipeline` | Spike artifacts (Scripts 32-35) stay on the spike branch; v4.0 paper work isolated on its own sub-branch; v3.20.0 preserved on yolo-signature-pipeline as fallback | ✓ Decided 2026-05-12 |

---
*Last updated: 2026-05-12 after Paper A v4.0 Big-4 reframe milestone bootstrap*