Structural: - Promote operational classifier definition from §III-L.0 to new §III-H.1, so the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K diagnostic chain instead of ~130 lines after. §III-L renamed to "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration methodology, three units of analysis, any-pair semantics, and the FAR terminological note. §III-L.7 deleted (redundant with §III-J). - Reorganise §V-H Limitations into Primary / Secondary / Documented features / Engineering groupings (was a flat 14-item list). - Reframe §III-M from "ten-tool unsupervised-validation collection" to "each diagnostic addresses one specific unsupervised failure mode"; rename "What v4.0 does/does not claim" → "Limits / Scope of the present analysis"; retitle Table XXVII. Framing alignment (cross-section): - Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact). - Replace "Paper A" rule references with "deployed" rule references. - Soften "validation" to "characterise" / "check" / "screening label" / "consistency check" / "support"; "verdict" → "screening label". - Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled cosine after firm-mean centring). Only formally scripted evidence (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating raw-axis cosine, not as calibration evidence. - Strip script-provenance parentheticals from Introduction; defer Script 39c internal references and similar to Methodology / Appendix. Numerical / table fixes: - §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable, verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198 no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs → 168,755 CPA-matched). Table I shows VLM-positive (86,084) and processed-for-extraction (86,071) as separate rows. - Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology table ([0.00011, 0.00018] and [0.00008, 0.00014]). - Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp). Smaller revisions: - Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion for consistency with the unsupervised-screening positioning. - "hard ground-truth subset" → "conservative hard-positive subset" throughout. - §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4; design-level argument deferred to supplementary materials. - "stakeholders can adopt / can derive thresholds" → "alternative operating points can be characterised by inverting" (less prescriptive). - "the same mechanism extending in milder form to Firms B/C/D" → "similar, milder production-related reuse patterns at Firms B/C/D" (mechanism claim softened). - Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language aligned with v4 framing. Appendix B: - Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping (Table IX–XVIII labels with FAR / capture-rate / validation language) is removed; replaced with a short paragraph pointing to supplementary materials for full table-to-script provenance. Cross-references: - All §III-L references for the rule definition retargeted to §III-H.1; references for calibration still point to §III-L. - §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor retargeted to §III-H.2. Artefacts: - Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines (was 1346 pre-review). - Two review handoff documents added: paper/review_handoff_abstract_intro_20260515.md paper/review_handoff_body_20260515.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Review Handoff: Abstract and Introduction
Date: 2026-05-15
Target manuscript: paper/paper_a_v4_combined.md
Scope reviewed: Abstract and Introduction only
Overall Assessment
The Abstract and Introduction are substantively strong and defensible. The current argument is clear:
- Regulations require CPA attestation, but digitized PDF workflows make stored-signature reuse operationally easy.
- The problem is not signature forgery; identity is not in dispute. The target is detecting possible image-level reproduction by the legitimate signer or firm workflow.
- The paper avoids claiming validated forensic detection and instead frames the system as an anchor-calibrated screening framework under unsupervised constraints.
- The strongest methodological move is replacing unsupported distributional "natural threshold" logic with anchor-based inter-CPA coincidence-rate (ICCR) calibration.
Recommended disposition: Minor Revision for prose and narrative complexity, not for core empirical weakness.
Main Reviewer Concern
The Introduction currently explains the methodology shift too explicitly as a research-process or version-history pivot. This is useful internally, but in the submitted paper it may increase complexity and invite reviewers to focus on why earlier versions used a different framing.
The final manuscript should explain the final methodological choice, not the internal research journey.
Keep:
- The descriptor distribution does not support a stable within-population bimodal antimode.
- Apparent multimodality is explained by firm composition and integer mass-point artefacts.
- Mixture fits are descriptive, not threshold-generating.
- Operational rules are characterized using anchor-based ICCR at multiple units.
Reduce or remove:
- "Earlier work in this lineage..."
- "v4.0 contribution..."
- "overturns this reading..."
- "inherited Paper A v3.x..."
- Internal script-heavy provenance in the Introduction.
Detailed provenance belongs in Methodology, Results, Appendix, or reproducibility notes, not in the opening narrative.
Suggested Rewrite Direction for Introduction Pivot Paragraph
Current issue location: around paper/paper_a_v4_combined.md, Introduction paragraph beginning with "The methodological reframing relative to earlier versions..."
Recommended replacement direction:
A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location shifts and integer mass-point artefacts on the dHash axis. After firm-mean centring and integer-tie jitter, the pooled dHash dip-test rejection disappears. Within-firm diagnostics likewise do not reveal a stable bimodal antimode. We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
This preserves the methodological defense while removing the internal v3-to-v4 story.
Abstract-Specific Comments
The Abstract is strong but very dense. It is currently optimized for technical reviewers rather than broad readability. That may be acceptable for IEEE Access, but the first sentence has a small grammar/style issue.
Suggested edit:
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports -- through administrative stamping or firm-level electronic signing -- thereby undermining individualized attestation.
Reason:
- Current wording: "digitization makes reusing ... undermining ..." is grammatically awkward.
- The suggested version makes the causal relation explicit.
No need to remove the final limitation sentence. The sentence "not as a validated forensic detector; no calibrated error rates..." is important and should remain.
Introduction-Specific Comments
1. Keep the legal framing but avoid legal overclaiming
The sentence saying non-hand-signed workflows "may fall within the literal statutory requirement" is acceptable because it is cautious. Do not strengthen it into a legal conclusion.
Preferred style:
- "may fall within"
- "raises substantive concerns"
- "may not represent meaningful individual attestation"
Avoid:
- "violates"
- "illegal"
- "non-compliant"
- "fraudulent"
2. Preserve the forgery distinction
The distinction between non-hand-signing detection and signature forgery detection is one of the strongest conceptual contributions. Keep it prominent.
Key idea to preserve:
- Forgery detection asks whether the signer is genuine.
- This paper asks whether the signing act was repeated for each document or a stored image was reused.
3. Reduce script/provenance detail in the Introduction
Current paragraph references scripts such as Script 39c and Script 39d. This makes the Introduction read like an internal review memo.
Recommendation:
- Remove or simplify script references from Introduction.
- Keep exact script provenance in Methodology, Results, Appendix B, or supplementary material.
Specific risk:
- The current parenthetical "10 firms tested in Script 39c" is imprecise for jittered-dHash. Script 39c raw dHash tests reject unimodality; the non-Big-4 jittered-dHash no-rejection statement depends on a codex-verified read-only spike on the same substrate.
Safer Introduction wording:
Within-firm diagnostics likewise fail to reveal stable bimodal structure after accounting for integer ties, including in eligible mid/small-firm checks.
If provenance must remain:
Within-firm signature-level cosine checks fail to reject in eligible firms, and corresponding jittered-dHash checks fail to reject in Big-4 firms and in a read-only spike on the same mid/small-firm substrate.
4. Avoid presenting the Introduction as a Results section
The Introduction currently contains many detailed numbers. Some are necessary because the paper is methodological, but the v4 pivot paragraphs are numerically heavy.
Keep headline numbers:
- Dataset size: 90,282 reports, 182,328 signatures, 758 CPAs.
- Big-4 scope: 437 CPAs, 150,442 signatures.
- Key ICCR levels: per-comparison, per-signature, per-document.
- Firm heterogeneity: Firm A 0.62 vs Firms B/C/D 0.09-0.16.
Consider moving or reducing:
- Full script-specific details.
- Too many parenthetical rule semantics in the Introduction.
- Repeated mentions of inherited/v3/v4 framing.
Recommended Minimum Patch List
- Fix Abstract first sentence grammar:
digitization makes it feasible to reuse...
-
Rewrite the Introduction paragraph that begins with "The methodological reframing relative to earlier versions..." so it describes the final methodological rationale rather than v3-to-v4 revision history.
-
Remove or narrow
Script 39cprovenance in the Introduction because the raw vs jittered dHash distinction is subtle and currently risky. -
Replace internal-version language across the Introduction:
- Replace "v4.0 adopts..." with "We adopt..."
- Replace "Earlier work in this lineage..." with "A distributional-threshold approach would be inappropriate here because..."
- Replace "inherited Paper A v3.x five-way box rule" with "the deployed five-way box rule" unless historical provenance is essential.
- Preserve limitation language:
- The paper should continue to say it is not a validated forensic detector.
- The paper should continue to say calibrated error rates cannot be reported without signature-level ground truth.
Reviewer Bottom Line
The paper should not hide that the distributional threshold path failed; that is actually a methodological strength. But it should present this as a final empirical finding and design rationale, not as a visible research-history correction.
Recommended framing:
Because the observed distribution does not provide a defensible natural threshold, we use ICCR calibration to characterize the deployed operating rules under explicit unsupervised assumptions.
This is cleaner, less complex, and more reviewer-facing than the current v3-to-v4 narrative.
Additional Framing Issue: Are We Giving Thresholds or Not?
A likely reviewer confusion point is whether the paper provides a concrete classifier threshold or merely explains why no defensible threshold can be derived.
The intended answer should be explicit:
- The paper does provide a concrete, reproducible operational classifier.
- The paper does not claim that this classifier is ground-truth-optimal.
- The paper does not claim that the operating thresholds are natural antimodes in the descriptor distribution.
- The paper's calibration contribution is to characterize the deployed rule's inter-CPA coincidence behavior under unsupervised assumptions.
Recommended high-level framing:
We use a fixed, pre-specified five-way operating rule. The present calibration does not derive an optimal threshold; instead, it quantifies the rule's inter-CPA coincidence behavior at per-comparison, per-signature, and per-document units under explicit unsupervised assumptions.
Chinese interpretation:
我們有一組明確、可重現的五分類操作規則;本文不是宣稱這組門檻是最佳門檻或自然分界點,而是在沒有 signature-level ground truth 的情況下,用 ICCR 量化這組規則的 specificity-proxy 行為。
Concrete Threshold Language to Make Visible
The manuscript should not bury the actual operating thresholds. Somewhere early in Methodology, and preferably summarized in Introduction, make the rule explicit:
High-confidence non-hand-signed: cosine > 0.95 AND dHash <= 5.
Moderate-confidence non-hand-signed: cosine > 0.95 AND 5 < dHash <= 15.
Other outcomes follow the fixed five-way box rule.
If space allows, add a compact sentence:
Thus, the system has explicit decision rules; what remains uncalibrated in the absence of signature-level labels is their true false-positive and false-negative error rate.
This directly answers the reviewer question: "Do the authors actually have a classifier?"
Rewrite Style Recommendation
Avoid language that sounds like the authors are unable to provide thresholds:
- Avoid: "No threshold can be derived."
- Avoid: "The distribution does not support classification."
- Avoid: "We cannot determine a threshold."
Use language that distinguishes operational thresholds from statistically natural or supervised-optimal thresholds:
- Prefer: "The deployed thresholds are operational rules rather than natural antimodes."
- Prefer: "We characterize these rules with ICCR rather than claiming supervised error rates."
- Prefer: "The absence of a distributional antimode motivates anchor-based calibration, not threshold-free analysis."
- Prefer: "The system is a concrete screening classifier with explicit unsupervised calibration limits."
Reviewer-Facing Answer to the Threshold Question
If the manuscript needs one sentence that resolves the ambiguity, use:
The system therefore uses explicit operating thresholds, but the evidentiary claim attached to those thresholds is limited: they define a reproducible screening rule whose coincidence behavior can be estimated under inter-CPA anchors, not a validated forensic decision boundary with calibrated error rates.
This should be the guiding style for Abstract, Introduction, and the start of Methodology.
Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding
The manuscript's multi-method statistical design increases rigor, but it also creates a readability risk. In the current form, some sections may feel like a defensive accumulation of diagnostics rather than a clean research design.
Reviewer risk:
- The reader may ask: "Are the authors using many methods because the core classifier is unclear?"
- The reader may miss the simple main claim because the paper introduces too many caveats and validation tools early.
- The paper may look like "we used many methods, therefore credible" instead of "each method answers one necessary question."
Recommended main-thread sentence:
We deploy a fixed five-way screening rule and characterize its unsupervised reliability limits using ICCR, after showing that the descriptor distribution does not support a natural threshold.
Chinese interpretation:
我們有明確五分類篩檢規則;先證明不能用自然分布切點來當門檻,再用 ICCR 描述這組規則在無標註資料中的可靠性邊界。
All methods and diagnostics should serve this main thread.
Core vs Supporting Diagnostics
Treat the following as core and keep them prominent:
- End-to-end pipeline: VLM -> YOLO -> ResNet -> cosine/dHash.
- Explicit five-way operating rule.
- Composition decomposition showing why the descriptor distribution does not yield a natural threshold.
- ICCR calibration at three units: per-comparison, per-signature, per-document.
- Firm heterogeneity and within-firm collision concentration.
- Ground-truth limitation and no true error-rate claim.
Treat the following as supporting diagnostics and avoid letting them dominate the main narrative:
- K=2 / K=3 mixture fits.
- Three-score Spearman convergence.
- Leave-one-firm-out reproducibility.
- BD/McCrary sensitivity.
- Ten-tool validation table.
- Pixel-identity positive anchor, especially because it is close to tautological for the high-confidence rule.
These supporting diagnostics can stay, but they should be framed as robustness checks, assumption checks, or supplementary evidence, not as independent central contributions.
Suggested Manuscript Structure for Clarity
Recommended structure for the Methodology / Results narrative:
- Core Method
Describe the pipeline, descriptor construction, and five-way rule.
- Why the Threshold Is Operational Rather Than Natural
Use the composition decomposition only. Avoid over-explaining K=3, BD/McCrary, or historical mixture logic here.
- How the Rule Is Calibrated Without Ground Truth
Explain ICCR and the three reporting units: per-comparison, per-signature, per-document.
- What the Calibration Reveals
Report firm heterogeneity and within-firm collision concentration.
- Supporting Diagnostics
Place K=3, Spearman convergence, LOOO, BD/McCrary, and pixel-identity checks here as supporting evidence.
Rewrite Style for Multi-Method Sections
Avoid:
We apply a multi-tool validation framework consisting of ten diagnostics...
This can sound like methodological stacking.
Prefer:
Each supporting diagnostic addresses a specific failure mode: composition artefacts, inter-CPA coincidence, pool-size effects, firm heterogeneity, or positive-anchor capture.
Avoid:
The conjunction of ten tools constitutes validation...
Prefer:
Together, these diagnostics define the limits of what can be supported without signature-level ground truth.
Avoid presenting auxiliary diagnostics before the reader understands the classifier.
Preferred order:
Rule first. Then why not natural threshold. Then ICCR calibration. Then robustness.
Reviewer-Facing Principle
The paper should not read as:
We used many methods, so the result is credible.
It should read as:
We use one explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints.
This distinction is important for readability and reviewer trust.