Files
pdf_signature_extraction/.planning/PROJECT.md
T
gbanyan e429e4eed1 Bootstrap .planning/ for Paper A v4.0 milestone
Hand-written minimal GSD scaffolding (PROJECT.md / REQUIREMENTS.md /
ROADMAP.md / STATE.md) without running /gsd-ingest-docs because:

  * 51 pre-existing markdown files exceed the v1 50-doc cap and most
    are stale (older review rounds, infrastructure notes) or already
    captured in auto-memory project_signature_research.md
  * Heavyweight ingest workflow not needed when project context is
    already comprehensive

PROJECT.md captures the Big-4 reframe key decision and the locked
v3.x history; REQUIREMENTS.md defines REQ-001..008 for v4.0;
ROADMAP.md lays out 7 phases (Foundation -> Methodology -> Results
-> Prose -> AI peer review -> Partner re-review -> Submission);
STATE.md anchors at Phase 1 entry on branch paper-a-v4-big4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:43:34 +08:00

75 lines
6.1 KiB
Markdown

# Taiwan TWSE CPA Signature Authentication
## What This Is
A computer-vision research pipeline that classifies whether the CPA signatures appearing on Taiwan TWSE-listed-company financial reports are hand-signed (親簽) or non-hand-signed (非親簽 — early-period rubber-stamp / scan, or post-2020 firm-level electronic signature systems). The pipeline ingests ~90k PDFs (2013-2023), detects ~182k signatures with YOLOv11n, embeds them with ResNet-50 (ImageNet1K_V2, no fine-tune), and characterises distributional structure with cosine + independent dHash descriptors. Target: a peer-reviewed publication (IEEE Access, A/6 on the NCKU CSIE journal list).
## Core Value
A statistically defensible, **reproducible** thresholding methodology that distinguishes hand-signed from digitally-replicated CPA signatures at the population level, with traceable evidence at every step (DB → script → table → paper claim).
## Requirements
### Validated
<!-- Shipped and confirmed valuable. -->
- ✓ End-to-end pipeline (TWSE MOPS scrape → Qwen2.5-VL prefilter → YOLO detection → ResNet embedding → DB + descriptors) — `signature_analysis/01-19`
- ✓ Independent dHash descriptor for replication detection — Script 14 (v3.x baseline)
- ✓ Accountant-level 3-component GMM characterisation — Script 18/20 (v3.x baseline)
- ✓ Paper A v3.20.0 manuscript (full-dataset framing, partner Jimmy 2026-04-27 substantive review accepted, codex 3-pass verification clean) — commit `53125d1` on `yolo-signature-pipeline`
- ✓ Spike scripts 32-35 confirming Big-4-only scope is methodologically superior — commits `e1d81e3`, `8ac0988`, `55f9f94` on `paper-a-v4-big4`
### Active
<!-- Current scope. Building toward these. -->
**Milestone: Paper A v4.0 — Big-4 reframe (primary scope) + full-dataset robustness (secondary)**
- [ ] Foundation: rerun core scripts on Big-4 subset with `--scope=big4` flag (`/scripts 19, 20, 21, 24, 25`)
- [ ] Methodology rewrite: §III-G/I/J/L re-anchored on dip-test confirmed bimodality and bootstrap-stable Big-4 K=2 GMM (cos=0.975, dh=3.76)
- [ ] Results tables: regenerate Tables IV-XVIII on Big-4 subset; new §IV-K full-dataset secondary
- [ ] Prose rewrite: Abstract / Intro / Discussion / Conclusion with Firm A reframed as "templated end of Big-4" case study (was: hand-signed calibration anchor)
- [ ] AI peer review: ≥3 cross-AI rounds (codex, Gemini 3.x Pro, Opus 4.7) on the v4.0 manuscript
- [ ] Partner Jimmy second review on v4.0 (he proposed this direction; needs sign-off on execution)
- [ ] iThenticate <20%, eCF copyright form, IEEE Access submission portal upload + cover letter
### Out of Scope
<!-- Explicit boundaries. Includes reasoning to prevent re-adding. -->
- **Paper B (audit behaviour / policy implications)** — partner v4 contribution D, deferred to a separate paper after Paper A ships
- **Paper C standalone (reverse-anchor methodology)** — initial 2026-05-12 spike direction, **folded back into Paper A v4.0 §IV-K** as one robustness lens; does not warrant a separate manuscript
- **Mid/small-firm primary scope** — included as full-dataset secondary only; primary scope is Big-4 because dip-test only achieves multimodality at Big-4 level
- **Per-document classifier release as software product** — paper-only deliverable; no API / SaaS layer in scope
- **VLM behavioural interview / IRB study** — removed in v3.4; not coming back
## Context
- **Domain**: Taiwan-listed CPA audit signatures, 2013-2023; 4 Big-4 firms (勤業眾信 Deloitte, 安侯建業 KPMG, 資誠 PwC, 安永 EY) + ~30 mid/small firms
- **Hardware split**: YOLO + ResNet on RTX 4090 (CUDA, deterministic forward inference, fixed seed); statistical analysis on Apple Silicon MPS / CPU
- **Domain expert**: User has practitioner-level CPA-firm knowledge in Taiwan; recognises specific senior-partner names (e.g., 薛明玲 / 周建宏 are known PwC seniors that surfaced in Script 35's C1 cluster)
- **Partner**: 與 partner Jimmy 合作;Jimmy 已提出 Big-4-only 方向,是 v4.0 的觸發者
## Constraints
- **Target journal**: IEEE Access (A/6 on NCKU CSIE list); fits Computer-Vision-applied-to-Audit scope
- **Timeline**: v3.20.0 was already partner-reviewed and DOCX-shipped (2026-05-05). v4.0 reframe will delay submission by ~4-6 weeks but produces a stronger manuscript; partner Jimmy is aware and supportive
- **Reproducibility**: pipeline must run end-to-end on the existing `/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db` snapshot; no new data ingest in scope
- **AI review provenance**: every empirical claim must be backed by a fresh sqlite/grep against the named script — see `[[feedback-provenance-fabrication]]` memory; Gemini round-19 caught 4 fabricated provenance claims previously
## Key Decisions
| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Use ResNet-50 ImageNet1K_V2 without fine-tune | Reproducibility; avoid label leakage from fine-tuning on the same corpus | ✓ Validated through v3.x |
| Cosine + independent dHash dual descriptor | Cosine catches semantic similarity; independent dHash catches byte-level replication | ✓ Validated |
| Drop SSIM / pixel-pHash from descriptor set | Reviewer-rejected as redundant / fragile | ✓ v3.x rewrite |
| Drop A2 within-year uniformity assumption | Empirically falsified by Script 27 | ✓ v3.14 |
| **Reframe scope to Big-4 only as primary** | Dip-test multimodal only at Big-4 level (p<0.0001); mid/small noise distorted Paper A v3.x's published 0.945/8.10 threshold; partner Jimmy's earlier suggestion empirically confirmed by Scripts 32-35 | — Pending v4.0 |
| Reverse-anchor Paper C → folded into v4.0 §IV-K | Big-4 reframe is the stronger story; reverse-anchor is one of several lenses on the same data, not a standalone paper | ✓ Decided 2026-05-12 |
| Branch strategy: `paper-a-v4-big4` from `from-outside-of-firmA` from `yolo-signature-pipeline` | Spike artifacts (Scripts 32-35) stay on the spike branch; v4.0 paper work isolated on its own sub-branch; v3.20.0 preserved on yolo-signature-pipeline as fallback | ✓ Decided 2026-05-12 |
---
*Last updated: 2026-05-12 after Paper A v4.0 Big-4 reframe milestone bootstrap*