Files
pdf_signature_extraction/paper/paper_a_results.md
gbanyan 939a348da4 Add Paper A (IEEE TAI) complete draft with Firm A-calibrated dual-method classification
Paper draft includes all sections (Abstract through Conclusion), 36 references,
and supporting scripts. Key methodology: Cosine similarity + dHash dual-method
verification with thresholds calibrated against known-replication firm (Firm A).

Includes:
- 8 section markdown files (paper_a_*.md)
- Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0)
- Recalibrated classification script (84,386 PDFs, 5-tier system)
- Figure generation and Word export scripts
- Citation renumbering script ([1]-[36])
- Signature analysis pipeline (12 steps)
- YOLO extraction scripts

Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 23:05:33 +08:00

9.6 KiB

IV. Experiments and Results

A. Experimental Setup

All experiments were conducted on a workstation equipped with an Apple Silicon processor with Metal Performance Shaders (MPS) GPU acceleration. Feature extraction used PyTorch 2.9 with torchvision model implementations. The complete pipeline---from raw PDF processing through final classification---was implemented in Python.

B. Signature Detection Performance

The YOLOv11n model achieved high detection performance on the validation set (Table II), with all loss components converging by epoch 60 and no significant overfitting despite the relatively small training set (425 images). We note that Table II reports validation-set metrics, as no separate hold-out test set was reserved given the small annotation budget (500 images total). However, the subsequent production deployment provides practical validation: batch inference on 86,071 documents yielded 182,328 extracted signatures (Table III), with an average of 2.14 signatures per document, consistent with the standard practice of two certifying CPAs per audit report. The high VLM--YOLO agreement rate (98.8%) further corroborates detection reliability at scale.

C. Distribution Analysis

Fig. 2 presents the cosine similarity distributions for intra-class (same CPA) and inter-class (different CPAs) pairs. Table IV summarizes the distributional statistics.

Both distributions are left-skewed and leptokurtic. Shapiro-Wilk and Kolmogorov-Smirnov tests rejected normality for both (p < 0.001), confirming that parametric thresholds based on normality assumptions would be inappropriate. Distribution fitting identified the lognormal distribution as the best parametric fit (lowest AIC) for both classes, though we use this result only descriptively; all subsequent thresholds are derived nonparametrically via KDE to avoid distributional assumptions.

The KDE crossover---where the two density functions intersect---was located at 0.837. Under the assumption of equal prior probabilities and equal misclassification costs, this crossover approximates the optimal decision boundary between the two classes. We note that this threshold is derived from all-pairs similarity distributions and is used as a reference point for interpreting per-signature best-match scores; the relationship between the two scales is mediated by the fact that the best-match statistic selects the maximum over all pairwise comparisons for a given CPA, producing systematically higher values (see Section IV-D).

Statistical tests confirmed significant separation between the two distributions (Table V).

We emphasize that the pairwise observations are not independent---the same signature participates in multiple pairs---which inflates the effective sample size and renders p-values unreliable as measures of evidence strength. We therefore rely primarily on Cohen's d as an effect-size measure that is less sensitive to sample size. Cohen's d of 0.669 indicates a medium effect size [29], confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count.

D. Calibration Group Analysis

Fig. 3 presents the cosine similarity distribution of Firm A (the known-replication reference group) compared to the overall intra-class distribution.

Firm A comprises 180 CPAs contributing 16.0 million intra-firm signature pairs. Its distributional characteristics provide empirical anchors for threshold validation:

Firm A's per-signature best-match cosine similarity (mean = 0.980, std = 0.019) is notably higher and more concentrated than the overall CPA population (mean = 0.961, std = 0.029). Critically, 99.3% of Firm A's signatures exhibit a best-match similarity exceeding 0.90, and the 1st percentile is 0.908---establishing that any threshold set above 0.91 would fail to capture the most dissimilar replicated signatures in the calibration group.

This concentration provides strong empirical validation for the threshold selection: the KDE crossover at 0.837 captures essentially all of Firm A's signatures (>99.9%), while more conservative thresholds (e.g., 0.95) still capture 92.5%. The narrow spread (std = 0.019) further confirms that digital replication produces highly predictable similarity scores, as expected when the same source image is reused across documents with only scan-induced variations.

E. Classification Results

Table VII presents the classification results for 84,386 documents using the dual-method framework with Firm A-calibrated thresholds.

The dual-method classification reveals a nuanced picture within the 71,656 documents exceeding the cosine similarity threshold of 0.95. Rather than treating these uniformly as "likely copies" (as a single-metric approach would), the dHash dimension stratifies them into three distinct populations: 29,529 (41.2%) show converging structural evidence of replication (dHash ≤ 5), 36,994 (51.7%) show partial structural similarity (dHash 6--15) consistent with replication degraded by scan variations, and 5,133 (7.2%) show no structural corroboration (dHash > 15), suggesting high signing consistency rather than digital duplication.

Calibration Validation

The Firm A column in Table VII validates the calibration: 96.9% of Firm A's documents are classified as replication (high or moderate confidence), and only 0.6% fall into the "high style consistency" category. This confirms that the dHash thresholds, derived from Firm A's distributional characteristics (median = 5, 95th percentile = 15), correctly capture the known-replication population.

Among non-Firm-A CPAs with cosine > 0.95, only 11.3% exhibit dHash ≤ 5, compared to 58.7% for Firm A---a five-fold difference that demonstrates the discriminative power of the structural verification layer.

F. Ablation Study: Feature Backbone Comparison

To validate the choice of ResNet-50 as the feature extraction backbone, we conducted an ablation study comparing three pre-trained architectures: ResNet-50 (2048-dim), VGG-16 (4096-dim), and EfficientNet-B0 (1280-dim). All models used ImageNet pre-trained weights without fine-tuning, with identical preprocessing and L2 normalization. Table IX presents the comparison.

EfficientNet-B0 achieves the highest Cohen's d (0.707), indicating the greatest statistical separation between intra-class and inter-class distributions. However, it also exhibits the widest distributional spread (intra std = 0.123 vs. ResNet-50's 0.098), resulting in lower per-sample classification confidence. VGG-16 performs worst on all key metrics despite having the highest feature dimensionality (4096), suggesting that additional dimensions do not contribute discriminative information for this task.

ResNet-50 provides the best overall balance: (1) Cohen's d of 0.669 is competitive with EfficientNet-B0's 0.707; (2) its tighter distributions yield more reliable individual classifications; (3) the highest Firm A all-pairs 1st percentile (0.543) indicates that known-replication signatures are least likely to produce low-similarity outlier pairs under this backbone; and (4) its 2048-dimensional features offer a practical compromise between discriminative capacity and computational/storage efficiency for processing 182K+ signatures.