Files
pdf_signature_extraction/paper/paper_a_appendix_v3.md
T
gbanyan b6913d2f93 Phase 6 round-2 reviewer revisions: §III-H.1 promotion + framing alignment
Structural:
- Promote operational classifier definition from §III-L.0 to new §III-H.1, so
  the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K
  diagnostic chain instead of ~130 lines after. §III-L renamed to
  "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration
  methodology, three units of analysis, any-pair semantics, and the FAR
  terminological note. §III-L.7 deleted (redundant with §III-J).
- Reorganise §V-H Limitations into Primary / Secondary / Documented features /
  Engineering groupings (was a flat 14-item list).
- Reframe §III-M from "ten-tool unsupervised-validation collection" to
  "each diagnostic addresses one specific unsupervised failure mode";
  rename "What v4.0 does/does not claim" → "Limits / Scope of the present
  analysis"; retitle Table XXVII.

Framing alignment (cross-section):
- Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from
  rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact).
- Replace "Paper A" rule references with "deployed" rule references.
- Soften "validation" to "characterise" / "check" / "screening label" /
  "consistency check" / "support"; "verdict" → "screening label".
- Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled
  cosine after firm-mean centring). Only formally scripted evidence
  (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating
  raw-axis cosine, not as calibration evidence.
- Strip script-provenance parentheticals from Introduction; defer Script 39c
  internal references and similar to Methodology / Appendix.

Numerical / table fixes:
- §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable,
  verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198
  no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs →
  168,755 CPA-matched). Table I shows VLM-positive (86,084) and
  processed-for-extraction (86,071) as separate rows.
- Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology
  table ([0.00011, 0.00018] and [0.00008, 0.00014]).
- Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp).

Smaller revisions:
- Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion
  for consistency with the unsupervised-screening positioning.
- "hard ground-truth subset" → "conservative hard-positive subset" throughout.
- §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4;
  design-level argument deferred to supplementary materials.
- "stakeholders can adopt / can derive thresholds" → "alternative operating
  points can be characterised by inverting" (less prescriptive).
- "the same mechanism extending in milder form to Firms B/C/D" → "similar,
  milder production-related reuse patterns at Firms B/C/D" (mechanism claim
  softened).
- Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language
  aligned with v4 framing.

Appendix B:
- Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping
  (Table IX–XVIII labels with FAR / capture-rate / validation language) is
  removed; replaced with a short paragraph pointing to supplementary
  materials for full table-to-script provenance.

Cross-references:
- All §III-L references for the rule definition retargeted to §III-H.1;
  references for calibration still point to §III-L.
- §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor
  retargeted to §III-H.2.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines
  (was 1346 pre-review).
- Two review handoff documents added:
  paper/review_handoff_abstract_intro_20260515.md
  paper/review_handoff_body_20260515.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:07:31 +08:00

40 lines
3.9 KiB
Markdown

# Appendix A. BD/McCrary Bin-Width Sensitivity (Signature Level)
The main text (Section III-I, Section IV-D.2) treats the Burgstahler-Dichev / McCrary discontinuity procedure [38], [39] as a *density-smoothness diagnostic* rather than as a threshold estimator.
This appendix documents the empirical basis for that framing by sweeping the bin width across four (variant, bin-width) panels: Firm A and full-sample, each in the cosine and $\text{dHash}_\text{indep}$ direction.
<!-- TABLE A.I: BD/McCrary Bin-Width Sensitivity (two-sided alpha = 0.05, |Z| > 1.96)
| Variant | n | Bin width | Best transition | z_below | z_above |
|---------|---|-----------|-----------------|---------|---------|
| Firm A cosine (sig-level) | 60,448 | 0.003 | 0.9870 | -2.81 | +9.42 |
| Firm A cosine (sig-level) | 60,448 | 0.005 | 0.9850 | -9.57 | +19.07 |
| Firm A cosine (sig-level) | 60,448 | 0.010 | 0.9800 | -54.64 | +69.96 |
| Firm A cosine (sig-level) | 60,448 | 0.015 | 0.9750 | -85.86 | +106.17 |
| Firm A dHash_indep (sig-level) | 60,448 | 1 | 2.0 | -4.69 | +10.01 |
| Firm A dHash_indep (sig-level) | 60,448 | 2 | no transition | — | — |
| Firm A dHash_indep (sig-level) | 60,448 | 3 | no transition | — | — |
| Full-sample cosine (sig-level) | 168,740 | 0.003 | 0.9870 | -3.21 | +8.17 |
| Full-sample cosine (sig-level) | 168,740 | 0.005 | 0.9850 | -8.80 | +14.32 |
| Full-sample cosine (sig-level) | 168,740 | 0.010 | 0.9800 | -29.69 | +44.91 |
| Full-sample cosine (sig-level) | 168,740 | 0.015 | 0.9450 | -11.35 | +14.85 |
| Full-sample dHash_indep (sig-l.) | 168,740 | 1 | 2.0 | -6.22 | +4.89 |
| Full-sample dHash_indep (sig-l.) | 168,740 | 2 | 10.0 | -7.35 | +3.83 |
| Full-sample dHash_indep (sig-l.) | 168,740 | 3 | 9.0 | -11.05 | +45.39 |
-->
Two patterns are visible in Table A.I.
First, the procedure consistently identifies a "transition" under every bin width, but the *location* of that transition drifts monotonically with bin width (Firm A cosine: 0.987 → 0.985 → 0.980 → 0.975 as bin width grows from 0.003 to 0.015; full-sample dHash: 2 → 10 → 9 as the bin width grows from 1 to 3).
The $Z$ statistics also inflate superlinearly with the bin width (Firm A cosine $|Z|$ rises from $\sim 9$ at bin 0.003 to $\sim 106$ at bin 0.015) because wider bins aggregate more mass per bin and therefore shrink the per-bin standard error on a very large sample.
Both features are characteristic of a histogram-resolution artifact rather than of a genuine density discontinuity.
Second, the candidate transitions all locate *inside* the high-similarity region (cosine $\geq 0.975$, dHash $\leq 10$) rather than at a between-mode boundary, which is the location pattern we would expect of a clean within-population antimode.
Taken together, Table A.I shows that the signature-level BD/McCrary transitions are not a threshold in the usual sense---they are histogram-resolution-dependent local density anomalies located *inside* the non-hand-signed mode rather than between modes.
This observation supports the main-text decision to use BD/McCrary as a density-smoothness diagnostic rather than as a threshold estimator and reinforces the joint reading of Section IV-D that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold.
Raw per-bin $Z$ sequences and $p$-values for every (variant, bin-width) panel are available in the supplementary materials.
# Appendix B. Reproducibility Materials
The full table-to-script provenance mapping, script source code, and report artefacts for every numerical table and figure in this paper are provided in the supplementary materials. Scripts run deterministically under fixed random seeds documented there; reviewer reproduction should re-emit artefacts from the listed scripts rather than rely on any local path layout.