Phase 6 round-2 reviewer revisions: §III-H.1 promotion + framing alignment

Structural: - Promote operational classifier definition from §III-L.0 to new §III-H.1, so the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K diagnostic chain instead of ~130 lines after. §III-L renamed to "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration methodology, three units of analysis, any-pair semantics, and the FAR terminological note. §III-L.7 deleted (redundant with §III-J). - Reorganise §V-H Limitations into Primary / Secondary / Documented features / Engineering groupings (was a flat 14-item list). - Reframe §III-M from "ten-tool unsupervised-validation collection" to "each diagnostic addresses one specific unsupervised failure mode"; rename "What v4.0 does/does not claim" → "Limits / Scope of the present analysis"; retitle Table XXVII. Framing alignment (cross-section): - Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact). - Replace "Paper A" rule references with "deployed" rule references. - Soften "validation" to "characterise" / "check" / "screening label" / "consistency check" / "support"; "verdict" → "screening label". - Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled cosine after firm-mean centring). Only formally scripted evidence (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating raw-axis cosine, not as calibration evidence. - Strip script-provenance parentheticals from Introduction; defer Script 39c internal references and similar to Methodology / Appendix. Numerical / table fixes: - §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable, verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198 no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs → 168,755 CPA-matched). Table I shows VLM-positive (86,084) and processed-for-extraction (86,071) as separate rows. - Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology table ([0.00011, 0.00018] and [0.00008, 0.00014]). - Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp). Smaller revisions: - Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion for consistency with the unsupervised-screening positioning. - "hard ground-truth subset" → "conservative hard-positive subset" throughout. - §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4; design-level argument deferred to supplementary materials. - "stakeholders can adopt / can derive thresholds" → "alternative operating points can be characterised by inverting" (less prescriptive). - "the same mechanism extending in milder form to Firms B/C/D" → "similar, milder production-related reuse patterns at Firms B/C/D" (mechanism claim softened). - Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language aligned with v4 framing. Appendix B: - Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping (Table IX–XVIII labels with FAR / capture-rate / validation language) is removed; replaced with a short paragraph pointing to supplementary materials for full table-to-script provenance. Cross-references: - All §III-L references for the rule definition retargeted to §III-H.1; references for calibration still point to §III-L. - §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor retargeted to §III-H.2. Artefacts: - Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines (was 1346 pre-review). - Two review handoff documents added: paper/review_handoff_abstract_intro_20260515.md paper/review_handoff_body_20260515.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:07:31 +08:00
parent 12637cd413
commit b6913d2f93
13 changed files with 2267 additions and 227 deletions
@@ -0,0 +1,361 @@
+# Review Handoff: Abstract and Introduction
+
+Date: 2026-05-15
+Target manuscript: `paper/paper_a_v4_combined.md`
+Scope reviewed: Abstract and Introduction only
+
+## Overall Assessment
+
+The Abstract and Introduction are substantively strong and defensible. The current argument is clear:
+
+- Regulations require CPA attestation, but digitized PDF workflows make stored-signature reuse operationally easy.
+- The problem is not signature forgery; identity is not in dispute. The target is detecting possible image-level reproduction by the legitimate signer or firm workflow.
+- The paper avoids claiming validated forensic detection and instead frames the system as an anchor-calibrated screening framework under unsupervised constraints.
+- The strongest methodological move is replacing unsupported distributional "natural threshold" logic with anchor-based inter-CPA coincidence-rate (ICCR) calibration.
+
+Recommended disposition: Minor Revision for prose and narrative complexity, not for core empirical weakness.
+
+## Main Reviewer Concern
+
+The Introduction currently explains the methodology shift too explicitly as a research-process or version-history pivot. This is useful internally, but in the submitted paper it may increase complexity and invite reviewers to focus on why earlier versions used a different framing.
+
+The final manuscript should explain the final methodological choice, not the internal research journey.
+
+Keep:
+
+- The descriptor distribution does not support a stable within-population bimodal antimode.
+- Apparent multimodality is explained by firm composition and integer mass-point artefacts.
+- Mixture fits are descriptive, not threshold-generating.
+- Operational rules are characterized using anchor-based ICCR at multiple units.
+
+Reduce or remove:
+
+- "Earlier work in this lineage..."
+- "v4.0 contribution..."
+- "overturns this reading..."
+- "inherited Paper A v3.x..."
+- Internal script-heavy provenance in the Introduction.
+
+Detailed provenance belongs in Methodology, Results, Appendix, or reproducibility notes, not in the opening narrative.
+
+## Suggested Rewrite Direction for Introduction Pivot Paragraph
+
+Current issue location: around `paper/paper_a_v4_combined.md`, Introduction paragraph beginning with "The methodological reframing relative to earlier versions..."
+
+Recommended replacement direction:
+
+```text
+A key empirical finding is that the descriptor distributions do not support a within-population natural threshold. The apparent multimodality in the Big-4 accountant-level distribution is explained by between-firm location shifts and integer mass-point artefacts on the dHash axis. After firm-mean centring and integer-tie jitter, the pooled dHash dip-test rejection disappears. Within-firm diagnostics likewise do not reveal a stable bimodal antimode. We therefore treat mixture fits as descriptive summaries of firm-compositional structure rather than threshold-generating mechanisms, and calibrate the deployed operating rules using inter-CPA coincidence-rate anchors.
+```
+
+This preserves the methodological defense while removing the internal v3-to-v4 story.
+
+## Abstract-Specific Comments
+
+The Abstract is strong but very dense. It is currently optimized for technical reviewers rather than broad readability. That may be acceptable for IEEE Access, but the first sentence has a small grammar/style issue.
+
+Suggested edit:
+
+```text
+Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes it feasible to reuse a stored signature image across reports -- through administrative stamping or firm-level electronic signing -- thereby undermining individualized attestation.
+```
+
+Reason:
+
+- Current wording: "digitization makes reusing ... undermining ..." is grammatically awkward.
+- The suggested version makes the causal relation explicit.
+
+No need to remove the final limitation sentence. The sentence "not as a validated forensic detector; no calibrated error rates..." is important and should remain.
+
+## Introduction-Specific Comments
+
+### 1. Keep the legal framing but avoid legal overclaiming
+
+The sentence saying non-hand-signed workflows "may fall within the literal statutory requirement" is acceptable because it is cautious. Do not strengthen it into a legal conclusion.
+
+Preferred style:
+
+- "may fall within"
+- "raises substantive concerns"
+- "may not represent meaningful individual attestation"
+
+Avoid:
+
+- "violates"
+- "illegal"
+- "non-compliant"
+- "fraudulent"
+
+### 2. Preserve the forgery distinction
+
+The distinction between non-hand-signing detection and signature forgery detection is one of the strongest conceptual contributions. Keep it prominent.
+
+Key idea to preserve:
+
+- Forgery detection asks whether the signer is genuine.
+- This paper asks whether the signing act was repeated for each document or a stored image was reused.
+
+### 3. Reduce script/provenance detail in the Introduction
+
+Current paragraph references scripts such as Script 39c and Script 39d. This makes the Introduction read like an internal review memo.
+
+Recommendation:
+
+- Remove or simplify script references from Introduction.
+- Keep exact script provenance in Methodology, Results, Appendix B, or supplementary material.
+
+Specific risk:
+
+- The current parenthetical "10 firms tested in Script 39c" is imprecise for jittered-dHash. Script 39c raw dHash tests reject unimodality; the non-Big-4 jittered-dHash no-rejection statement depends on a codex-verified read-only spike on the same substrate.
+
+Safer Introduction wording:
+
+```text
+Within-firm diagnostics likewise fail to reveal stable bimodal structure after accounting for integer ties, including in eligible mid/small-firm checks.
+```
+
+If provenance must remain:
+
+```text
+Within-firm signature-level cosine checks fail to reject in eligible firms, and corresponding jittered-dHash checks fail to reject in Big-4 firms and in a read-only spike on the same mid/small-firm substrate.
+```
+
+### 4. Avoid presenting the Introduction as a Results section
+
+The Introduction currently contains many detailed numbers. Some are necessary because the paper is methodological, but the v4 pivot paragraphs are numerically heavy.
+
+Keep headline numbers:
+
+- Dataset size: 90,282 reports, 182,328 signatures, 758 CPAs.
+- Big-4 scope: 437 CPAs, 150,442 signatures.
+- Key ICCR levels: per-comparison, per-signature, per-document.
+- Firm heterogeneity: Firm A 0.62 vs Firms B/C/D 0.09-0.16.
+
+Consider moving or reducing:
+
+- Full script-specific details.
+- Too many parenthetical rule semantics in the Introduction.
+- Repeated mentions of inherited/v3/v4 framing.
+
+## Recommended Minimum Patch List
+
+1. Fix Abstract first sentence grammar:
+
+```text
+digitization makes it feasible to reuse...
+```
+
+2. Rewrite the Introduction paragraph that begins with "The methodological reframing relative to earlier versions..." so it describes the final methodological rationale rather than v3-to-v4 revision history.
+
+3. Remove or narrow `Script 39c` provenance in the Introduction because the raw vs jittered dHash distinction is subtle and currently risky.
+
+4. Replace internal-version language across the Introduction:
+
+- Replace "v4.0 adopts..." with "We adopt..."
+- Replace "Earlier work in this lineage..." with "A distributional-threshold approach would be inappropriate here because..."
+- Replace "inherited Paper A v3.x five-way box rule" with "the deployed five-way box rule" unless historical provenance is essential.
+
+5. Preserve limitation language:
+
+- The paper should continue to say it is not a validated forensic detector.
+- The paper should continue to say calibrated error rates cannot be reported without signature-level ground truth.
+
+## Reviewer Bottom Line
+
+The paper should not hide that the distributional threshold path failed; that is actually a methodological strength. But it should present this as a final empirical finding and design rationale, not as a visible research-history correction.
+
+Recommended framing:
+
+```text
+Because the observed distribution does not provide a defensible natural threshold, we use ICCR calibration to characterize the deployed operating rules under explicit unsupervised assumptions.
+```
+
+This is cleaner, less complex, and more reviewer-facing than the current v3-to-v4 narrative.
+
+## Additional Framing Issue: Are We Giving Thresholds or Not?
+
+A likely reviewer confusion point is whether the paper provides a concrete classifier threshold or merely explains why no defensible threshold can be derived.
+
+The intended answer should be explicit:
+
+- The paper does provide a concrete, reproducible operational classifier.
+- The paper does not claim that this classifier is ground-truth-optimal.
+- The paper does not claim that the operating thresholds are natural antimodes in the descriptor distribution.
+- The paper's calibration contribution is to characterize the deployed rule's inter-CPA coincidence behavior under unsupervised assumptions.
+
+Recommended high-level framing:
+
+```text
+We use a fixed, pre-specified five-way operating rule. The present calibration does not derive an optimal threshold; instead, it quantifies the rule's inter-CPA coincidence behavior at per-comparison, per-signature, and per-document units under explicit unsupervised assumptions.
+```
+
+Chinese interpretation:
+
+```text
+我們有一組明確、可重現的五分類操作規則；本文不是宣稱這組門檻是最佳門檻或自然分界點，而是在沒有 signature-level ground truth 的情況下，用 ICCR 量化這組規則的 specificity-proxy 行為。
+```
+
+## Concrete Threshold Language to Make Visible
+
+The manuscript should not bury the actual operating thresholds. Somewhere early in Methodology, and preferably summarized in Introduction, make the rule explicit:
+
+```text
+High-confidence non-hand-signed: cosine > 0.95 AND dHash <= 5.
+Moderate-confidence non-hand-signed: cosine > 0.95 AND 5 < dHash <= 15.
+Other outcomes follow the fixed five-way box rule.
+```
+
+If space allows, add a compact sentence:
+
+```text
+Thus, the system has explicit decision rules; what remains uncalibrated in the absence of signature-level labels is their true false-positive and false-negative error rate.
+```
+
+This directly answers the reviewer question: "Do the authors actually have a classifier?"
+
+## Rewrite Style Recommendation
+
+Avoid language that sounds like the authors are unable to provide thresholds:
+
+- Avoid: "No threshold can be derived."
+- Avoid: "The distribution does not support classification."
+- Avoid: "We cannot determine a threshold."
+
+Use language that distinguishes operational thresholds from statistically natural or supervised-optimal thresholds:
+
+- Prefer: "The deployed thresholds are operational rules rather than natural antimodes."
+- Prefer: "We characterize these rules with ICCR rather than claiming supervised error rates."
+- Prefer: "The absence of a distributional antimode motivates anchor-based calibration, not threshold-free analysis."
+- Prefer: "The system is a concrete screening classifier with explicit unsupervised calibration limits."
+
+## Reviewer-Facing Answer to the Threshold Question
+
+If the manuscript needs one sentence that resolves the ambiguity, use:
+
+```text
+The system therefore uses explicit operating thresholds, but the evidentiary claim attached to those thresholds is limited: they define a reproducible screening rule whose coincidence behavior can be estimated under inter-CPA anchors, not a validated forensic decision boundary with calibrated error rates.
+```
+
+This should be the guiding style for Abstract, Introduction, and the start of Methodology.
+
+## Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding
+
+The manuscript's multi-method statistical design increases rigor, but it also creates a readability risk. In the current form, some sections may feel like a defensive accumulation of diagnostics rather than a clean research design.
+
+Reviewer risk:
+
+- The reader may ask: "Are the authors using many methods because the core classifier is unclear?"
+- The reader may miss the simple main claim because the paper introduces too many caveats and validation tools early.
+- The paper may look like "we used many methods, therefore credible" instead of "each method answers one necessary question."
+
+Recommended main-thread sentence:
+
+```text
+We deploy a fixed five-way screening rule and characterize its unsupervised reliability limits using ICCR, after showing that the descriptor distribution does not support a natural threshold.
+```
+
+Chinese interpretation:
+
+```text
+我們有明確五分類篩檢規則；先證明不能用自然分布切點來當門檻，再用 ICCR 描述這組規則在無標註資料中的可靠性邊界。
+```
+
+All methods and diagnostics should serve this main thread.
+
+## Core vs Supporting Diagnostics
+
+Treat the following as core and keep them prominent:
+
+- End-to-end pipeline: VLM -> YOLO -> ResNet -> cosine/dHash.
+- Explicit five-way operating rule.
+- Composition decomposition showing why the descriptor distribution does not yield a natural threshold.
+- ICCR calibration at three units: per-comparison, per-signature, per-document.
+- Firm heterogeneity and within-firm collision concentration.
+- Ground-truth limitation and no true error-rate claim.
+
+Treat the following as supporting diagnostics and avoid letting them dominate the main narrative:
+
+- K=2 / K=3 mixture fits.
+- Three-score Spearman convergence.
+- Leave-one-firm-out reproducibility.
+- BD/McCrary sensitivity.
+- Ten-tool validation table.
+- Pixel-identity positive anchor, especially because it is close to tautological for the high-confidence rule.
+
+These supporting diagnostics can stay, but they should be framed as robustness checks, assumption checks, or supplementary evidence, not as independent central contributions.
+
+## Suggested Manuscript Structure for Clarity
+
+Recommended structure for the Methodology / Results narrative:
+
+1. Core Method
+
+Describe the pipeline, descriptor construction, and five-way rule.
+
+2. Why the Threshold Is Operational Rather Than Natural
+
+Use the composition decomposition only. Avoid over-explaining K=3, BD/McCrary, or historical mixture logic here.
+
+3. How the Rule Is Calibrated Without Ground Truth
+
+Explain ICCR and the three reporting units: per-comparison, per-signature, per-document.
+
+4. What the Calibration Reveals
+
+Report firm heterogeneity and within-firm collision concentration.
+
+5. Supporting Diagnostics
+
+Place K=3, Spearman convergence, LOOO, BD/McCrary, and pixel-identity checks here as supporting evidence.
+
+## Rewrite Style for Multi-Method Sections
+
+Avoid:
+
+```text
+We apply a multi-tool validation framework consisting of ten diagnostics...
+```
+
+This can sound like methodological stacking.
+
+Prefer:
+
+```text
+Each supporting diagnostic addresses a specific failure mode: composition artefacts, inter-CPA coincidence, pool-size effects, firm heterogeneity, or positive-anchor capture.
+```
+
+Avoid:
+
+```text
+The conjunction of ten tools constitutes validation...
+```
+
+Prefer:
+
+```text
+Together, these diagnostics define the limits of what can be supported without signature-level ground truth.
+```
+
+Avoid presenting auxiliary diagnostics before the reader understands the classifier.
+
+Preferred order:
+
+```text
+Rule first. Then why not natural threshold. Then ICCR calibration. Then robustness.
+```
+
+## Reviewer-Facing Principle
+
+The paper should not read as:
+
+```text
+We used many methods, so the result is credible.
+```
+
+It should read as:
+
+```text
+We use one explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints.
+```
+
+This distinction is important for readability and reviewer trust.