Files

T

gbanyan e429e4eed1 Bootstrap .planning/ for Paper A v4.0 milestone

Hand-written minimal GSD scaffolding (PROJECT.md / REQUIREMENTS.md /
ROADMAP.md / STATE.md) without running /gsd-ingest-docs because:

  * 51 pre-existing markdown files exceed the v1 50-doc cap and most
    are stale (older review rounds, infrastructure notes) or already
    captured in auto-memory project_signature_research.md
  * Heavyweight ingest workflow not needed when project context is
    already comprehensive

PROJECT.md captures the Big-4 reframe key decision and the locked
v3.x history; REQUIREMENTS.md defines REQ-001..008 for v4.0;
ROADMAP.md lays out 7 phases (Foundation -> Methodology -> Results
-> Prose -> AI peer review -> Partner re-review -> Submission);
STATE.md anchors at Phase 1 entry on branch paper-a-v4-big4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 14:43:34 +08:00

6.1 KiB

Raw Permalink Blame History

Taiwan TWSE CPA Signature Authentication

What This Is

A computer-vision research pipeline that classifies whether the CPA signatures appearing on Taiwan TWSE-listed-company financial reports are hand-signed (親簽) or non-hand-signed (非親簽 — early-period rubber-stamp / scan, or post-2020 firm-level electronic signature systems). The pipeline ingests ~90k PDFs (2013-2023), detects ~182k signatures with YOLOv11n, embeds them with ResNet-50 (ImageNet1K_V2, no fine-tune), and characterises distributional structure with cosine + independent dHash descriptors. Target: a peer-reviewed publication (IEEE Access, A/6 on the NCKU CSIE journal list).

Core Value

A statistically defensible, reproducible thresholding methodology that distinguishes hand-signed from digitally-replicated CPA signatures at the population level, with traceable evidence at every step (DB → script → table → paper claim).

Requirements

Validated

✓ End-to-end pipeline (TWSE MOPS scrape → Qwen2.5-VL prefilter → YOLO detection → ResNet embedding → DB + descriptors) — signature_analysis/01-19
✓ Independent dHash descriptor for replication detection — Script 14 (v3.x baseline)
✓ Accountant-level 3-component GMM characterisation — Script 18/20 (v3.x baseline)
✓ Paper A v3.20.0 manuscript (full-dataset framing, partner Jimmy 2026-04-27 substantive review accepted, codex 3-pass verification clean) — commit 53125d1 on yolo-signature-pipeline
✓ Spike scripts 32-35 confirming Big-4-only scope is methodologically superior — commits e1d81e3, 8ac0988, 55f9f94 on paper-a-v4-big4

Active

Milestone: Paper A v4.0 — Big-4 reframe (primary scope) + full-dataset robustness (secondary)

Foundation: rerun core scripts on Big-4 subset with --scope=big4 flag (/scripts 19, 20, 21, 24, 25)
Methodology rewrite: §III-G/I/J/L re-anchored on dip-test confirmed bimodality and bootstrap-stable Big-4 K=2 GMM (cos=0.975, dh=3.76)
Results tables: regenerate Tables IV-XVIII on Big-4 subset; new §IV-K full-dataset secondary
Prose rewrite: Abstract / Intro / Discussion / Conclusion with Firm A reframed as "templated end of Big-4" case study (was: hand-signed calibration anchor)
AI peer review: ≥3 cross-AI rounds (codex, Gemini 3.x Pro, Opus 4.7) on the v4.0 manuscript
Partner Jimmy second review on v4.0 (he proposed this direction; needs sign-off on execution)
iThenticate <20%, eCF copyright form, IEEE Access submission portal upload + cover letter

Out of Scope

Paper B (audit behaviour / policy implications) — partner v4 contribution D, deferred to a separate paper after Paper A ships
Paper C standalone (reverse-anchor methodology) — initial 2026-05-12 spike direction, folded back into Paper A v4.0 §IV-K as one robustness lens; does not warrant a separate manuscript
Mid/small-firm primary scope — included as full-dataset secondary only; primary scope is Big-4 because dip-test only achieves multimodality at Big-4 level
Per-document classifier release as software product — paper-only deliverable; no API / SaaS layer in scope
VLM behavioural interview / IRB study — removed in v3.4; not coming back

Context

Domain: Taiwan-listed CPA audit signatures, 2013-2023; 4 Big-4 firms (勤業眾信 Deloitte, 安侯建業 KPMG, 資誠 PwC, 安永 EY) + ~30 mid/small firms
Hardware split: YOLO + ResNet on RTX 4090 (CUDA, deterministic forward inference, fixed seed); statistical analysis on Apple Silicon MPS / CPU
Domain expert: User has practitioner-level CPA-firm knowledge in Taiwan; recognises specific senior-partner names (e.g., 薛明玲 / 周建宏 are known PwC seniors that surfaced in Script 35's C1 cluster)
Partner: 與 partner Jimmy 合作；Jimmy 已提出 Big-4-only 方向，是 v4.0 的觸發者

Constraints

Target journal: IEEE Access (A/6 on NCKU CSIE list); fits Computer-Vision-applied-to-Audit scope
Timeline: v3.20.0 was already partner-reviewed and DOCX-shipped (2026-05-05). v4.0 reframe will delay submission by ~4-6 weeks but produces a stronger manuscript; partner Jimmy is aware and supportive
Reproducibility: pipeline must run end-to-end on the existing /Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db snapshot; no new data ingest in scope
AI review provenance: every empirical claim must be backed by a fresh sqlite/grep against the named script — see [[feedback-provenance-fabrication]] memory; Gemini round-19 caught 4 fabricated provenance claims previously

Key Decisions

Decision	Rationale	Outcome
Use ResNet-50 ImageNet1K_V2 without fine-tune	Reproducibility; avoid label leakage from fine-tuning on the same corpus	✓ Validated through v3.x
Cosine + independent dHash dual descriptor	Cosine catches semantic similarity; independent dHash catches byte-level replication	✓ Validated
Drop SSIM / pixel-pHash from descriptor set	Reviewer-rejected as redundant / fragile	✓ v3.x rewrite
Drop A2 within-year uniformity assumption	Empirically falsified by Script 27	✓ v3.14
Reframe scope to Big-4 only as primary	Dip-test multimodal only at Big-4 level (p<0.0001); mid/small noise distorted Paper A v3.x's published 0.945/8.10 threshold; partner Jimmy's earlier suggestion empirically confirmed by Scripts 32-35	— Pending v4.0
Reverse-anchor Paper C → folded into v4.0 §IV-K	Big-4 reframe is the stronger story; reverse-anchor is one of several lenses on the same data, not a standalone paper	✓ Decided 2026-05-12
Branch strategy: `paper-a-v4-big4` from `from-outside-of-firmA` from `yolo-signature-pipeline`	Spike artifacts (Scripts 32-35) stay on the spike branch; v4.0 paper work isolated on its own sub-branch; v3.20.0 preserved on yolo-signature-pipeline as fallback	✓ Decided 2026-05-12

Last updated: 2026-05-12 after Paper A v4.0 Big-4 reframe milestone bootstrap

6.1 KiB Raw Permalink Blame History