Files

T

gbanyan b6913d2f93 Phase 6 round-2 reviewer revisions: §III-H.1 promotion + framing alignment

Structural:
- Promote operational classifier definition from §III-L.0 to new §III-H.1, so
  the reader meets the five-way HC/MC/HSC/UN/LH rule before the §III-I/J/K
  diagnostic chain instead of ~130 lines after. §III-L renamed to
  "Anchor-Based Threshold Calibration"; §III-L.0 retains only calibration
  methodology, three units of analysis, any-pair semantics, and the FAR
  terminological note. §III-L.7 deleted (redundant with §III-J).
- Reorganise §V-H Limitations into Primary / Secondary / Documented features /
  Engineering groupings (was a flat 14-item list).
- Reframe §III-M from "ten-tool unsupervised-validation collection" to
  "each diagnostic addresses one specific unsupervised failure mode";
  rename "What v4.0 does/does not claim" → "Limits / Scope of the present
  analysis"; retitle Table XXVII.

Framing alignment (cross-section):
- Strip all v3.x / v4.0 / v3.20 / v4-new / inherited lineage labels from
  rendered text (Abstract, Intro, §II, §III, §IV, §V, §VI, Appendix, Impact).
- Replace "Paper A" rule references with "deployed" rule references.
- Soften "validation" to "characterise" / "check" / "screening label" /
  "consistency check" / "support"; "verdict" → "screening label".
- Remove codex-verified spike claims (non-Big-4 jittered dHash, Big-4 pooled
  cosine after firm-mean centring). Only formally scripted evidence
  (Scripts 39b–39e) retained; non-Big-4 evidence framed as corroborating
  raw-axis cosine, not as calibration evidence.
- Strip script-provenance parentheticals from Introduction; defer Script 39c
  internal references and similar to Methodology / Appendix.

Numerical / table fixes:
- §III-C document-count arithmetic: 12 corrupted → 13 corrupted/unreadable,
  verified against sqlite DB and total-pdf/ folder counts (90,282 - 4,198
  no-sig - 13 corrupted = 86,071 → 85,042 with detections → 182,328 sigs →
  168,755 CPA-matched). Table I shows VLM-positive (86,084) and
  processed-for-extraction (86,071) as separate rows.
- Wilson 95% CIs added for joint-rule ICCR rows in Table XXI / methodology
  table ([0.00011, 0.00018] and [0.00008, 0.00014]).
- Unit error fixed: 0.3856 pp / 0.4431 pp → 0.3856 (38.6 pp) / 0.4431 (44.3 pp).

Smaller revisions:
- Pipeline framing: "detecting" → "screening" in Abstract / Intro / Conclusion
  for consistency with the unsupervised-screening positioning.
- "hard ground-truth subset" → "conservative hard-positive subset" throughout.
- §III-F SSIM / pixel-comparison rebuttal compressed from ~15 lines to 4;
  design-level argument deferred to supplementary materials.
- "stakeholders can adopt / can derive thresholds" → "alternative operating
  points can be characterised by inverting" (less prescriptive).
- "the same mechanism extending in milder form to Firms B/C/D" → "similar,
  milder production-related reuse patterns at Firms B/C/D" (mechanism claim
  softened).
- Appendix A "non-hand-signed mode" / "two-mechanism mixture" lineage language
  aligned with v4 framing.

Appendix B:
- Rebuilt as a redirect-only stub. The HTML-commented obsolete table mapping
  (Table IX–XVIII labels with FAR / capture-rate / validation language) is
  removed; replaced with a short paragraph pointing to supplementary
  materials for full table-to-script provenance.

Cross-references:
- All §III-L references for the rule definition retargeted to §III-H.1;
  references for calibration still point to §III-L.
- §III-H references for byte-level Firm A evidence / non-Big-4 reverse anchor
  retargeted to §III-H.2.

Artefacts:
- Combined manuscript regenerated: paper_a_v4_combined.md, 1314 lines
  (was 1346 pre-review).
- Two review handoff documents added:
  paper/review_handoff_abstract_intro_20260515.md
  paper/review_handoff_body_20260515.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 18:07:31 +08:00

30 KiB

Raw Blame History

Review Handoff: Methodology, Results, Discussion, Conclusion

Date: 2026-05-15 Target manuscript: paper/paper_a_v4_combined.md Scope reviewed: §III Methodology, §IV Experiments and Results, §V Discussion, §VI Conclusion Companion review: paper/review_handoff_abstract_intro_20260515.md (Abstract + Introduction)

This handoff continues the same framing principle established for Abstract + Introduction:

"One explicit screening rule. Each statistical diagnostic answers one necessary question about how that rule should be interpreted under unsupervised constraints."

If only the Abstract and Introduction are revised, the manuscript will exhibit tonal mismatch when the reader drops into the body sections, which currently retain internal-version language and a defensive-accumulation framing for the supporting diagnostics. The body must be brought into the same register.

Overall Assessment

The body sections are substantively defensible. The core empirical results — composition decomposition, anchor-based ICCR at three units, firm heterogeneity logistic, cross-firm hit matrix, alert-rate sensitivity — are presented in adequate quantitative detail with explicit unsupervised-validation caveats. The Discussion correctly distinguishes positive and negative anchors. The Conclusion lists eight methodological contributions that map onto the v4 contribution set.

The recurring weakness across §III / §IV / §V / §VI is not empirical. It is two intertwined narrative tendencies:

The body is still written as a revision history relative to v3.x in many paragraphs — "v4.0 strengthens", "v4.0 retroactively reframes", "v4.0 adopts", "inherited from v3.x", "the v3.x role of Firm A". This is internally honest but, in a submitted paper, signals to the reviewer that the authors are arguing with themselves.
The supporting diagnostics are repeatedly presented as a collection ("multi-tool framework", "ten-tool unsupervised-validation collection", "Table XXVII"). This collection framing is precisely the readability risk identified in the Abstract / Introduction handoff under "Readability Risk: Too Many Diagnostics Can Look Like Methodological Overbuilding." It currently appears unmodified in §III-M.

Recommended disposition: Minor Revision for narrative voice and structural emphasis, not for empirical weakness.

Main Reviewer Concerns

1. The v3-to-v4 revision narrative is pervasive in the body and must be removed

The Abstract / Introduction handoff identified "v4.0 adopts", "Earlier work in this lineage", and "inherited Paper A v3.x five-way box rule" as patterns to strip. The same patterns occur throughout the body sections. Representative instances (not exhaustive):

§III-G: "We earlier (v4.0 first draft) listed 'statistical multimodality at the accountant level' among the scope justifications..."
§III-H opening: "v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing."
§III-I.5 closing sentence: "§III-L develops the v4.0 anchor-based threshold calibration framework..."
§III-L.0 "Why retained without v4.0 recalibration" subsection title.
§III-L.7 closing: "The operational classifier of §III-L.0 is the inherited v3.x five-way box rule..."
§IV opening paragraph: "The v4.0 primary analyses (§IV-D through §IV-J) are scoped to..." and "§IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables."
§IV-I: "v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate'..."
§IV-J: "v4.0 does not change this aggregation rule; only the population over which it is computed changes (Big-4 subset)."
§IV-M opening: "v4-new empirical results that support..."
§V-B: "A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture... v4.0 strengthens and extends this signature-level reading."
§V-C: "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold."
§V-H opening: "The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline."

The remediation principle is the same as for the Introduction pivot paragraph. The final manuscript should describe the final methodological state and its rationale, not the trajectory by which that state was reached. Internal provenance — "this analysis is reproduced from v3.x §IV-F.1 / Script 28" — belongs in an Appendix B reproducibility table or supplementary material, not in the main narrative arc.

A safe rewriting heuristic: every sentence that begins with "v4.0", "v3.x", "v4-new", "inherited", or "earlier work" should be candidated for either deletion or rewriting in the present tense without version labels.

2. The "Ten-Tool Unsupervised-Validation Collection" frame must be retired

§III-M Table XXVII is the canonical instance of the readability risk that the Abstract / Introduction handoff flagged. The current frame is:

"v4.0 adopts a multi-tool collection of partial-evidence diagnostics (Table XXVII), each with an explicitly disclosed assumption..." "No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits."

This is exactly the language the Abstract / Introduction handoff identified as risky ("We used many methods, so the result is credible"). It reappears verbatim in the §VI Conclusion as "a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope" and "(8) a ten-tool unsupervised-validation collection (§III-M Table XXVII) that explicitly discloses each tool's untested assumption."

The recommended reframe is:

The corpus does not admit standard supervised classifier validation: no signature-level
ground truth exists for hand-signed versus replicated classes, so False Rejection Rate,
sensitivity, recall, EER, ROC-AUC, precision, and positive predictive value are not
reportable. Each diagnostic in this section therefore addresses one specific
failure mode of an unsupervised screening classifier: composition artefacts,
inter-CPA coincidence, pool-size confounding, firm heterogeneity, threshold
sensitivity, or positive-anchor capture. Together they characterise the limits of
what can be claimed without signature-level ground truth.

Keep Table XXVII as a reference table if useful, but retitle it as "Diagnostic — failure mode addressed — disclosed assumption" rather than "Ten-tool collection". The word "ten" should not appear in the manuscript.

3. The §V-H Limitations list is correct but defensively ordered

§V-H lists fourteen limitations. The first one — "No signature-level ground truth; no true error rates reportable" — is the load-bearing limitation that everything else in v4.0 hinges on. The next two — "Inter-CPA negative-anchor assumption is partially violated" and "Scope" — are also major. The other eleven are real but secondary. The current presentation gives every item roughly equal visual weight as a flat list.

Recommended reorganisation:

Primary limitations (3 items): (a) no signature-level ground truth, (b) inter-CPA negative-anchor assumption partially violated and firm-dependent, (c) Big-4 scope (full-dataset robustness is light).
Secondary limitations (4 items): pixel-identity conservative subset; inherited rule components not separately v4-validated; deployed-rate excess not a true-positive rate; A1 pair-detectability stipulation.
Documented features rather than limitations (2 items): K=3 hard-posterior composition sensitivity; no partner-level mechanism attribution.
Inherited engineering limitations (5 items): ImageNet features, red-stamp HSV preprocessing, longitudinal scan / PDF / compression, source-exemplar misattribution, legal interpretation.

This preserves the disclosures but signals to the reviewer which limitations carry the methodological weight and which are routine engineering caveats.

4. §III-F SSIM and pixel-comparison justification is too long for Methodology

§III-F currently dedicates roughly 15 lines (lines 112–127 in paper_a_methodology_v3.md) to justifying why SSIM and pixel-level comparison are not used as primary descriptors. The argument is correct (design-level mismatch between SSIM's natural-image quality factors and signature-crop artefacts; sub-pixel alignment fragility of pixel L1/L2), but in its current form it reads as a defensive response to an anticipated reviewer objection rather than as forward Methodology exposition.

Recommended reduction: collapse the argument to one short paragraph (3–4 sentences) and move the full design-level discussion to Appendix B. The Methodology body should state the choice (cosine on deep features + dHash) and briefly justify it (both stable across print-scan cycles by design), with the SSIM / pixel-comparison rebuttal in an appendix or a single citation footnote.

5. §IV's section opener still encodes provenance not appropriate to a Results section opener

The current §IV opener:

"The v4.0 primary analyses (§IV-D through §IV-J) are scoped to the Big-4 sub-corpus (Firms A–D, n = 437 CPAs with n_sig ≥ 10, totalling 150,442 signatures with both descriptors available) per the methodology choice articulated in §III-G. The §IV-K Full-Dataset Robustness section reports the full-dataset (686 CPAs) variant of the K=3 mixture + Paper A box-rule Spearman analysis as a cross-scope robustness check. §IV-A through §IV-C report inherited corpus-wide v3.x material; §IV-L (feature backbone ablation) is also inherited. §IV-M consolidates the v4-new anchor-based ICCR calibration tables."

Recommended replacement direction:

Section IV reports the empirical results that calibrate and characterise the
operational classifier of §III-L. The primary analyses (§IV-D through §IV-J,
§IV-M) are scoped to the Big-4 sub-corpus (Firms A–D, 437 CPAs, 150,442
signatures); §IV-K reports a full-dataset (686 CPAs) robustness check on the K=3
mixture and per-CPA score-rank convergence; §IV-A through §IV-C and §IV-L
report the corpus-wide pipeline performance and feature-backbone ablation that
support the descriptor choice of §III-F.

This preserves the scope information while removing the v3-to-v4 inheritance labels and the "v4-new" prefix on §IV-M.

Section-by-Section Comments

§III-A Pipeline Overview

The pipeline diagram caption (lines 12–20) describes the classifier as "Firm A P7.5-anchored", which is residual v3 language that conflicts with the v4 reframe. v4 explicitly abandons Firm A as the calibration anchor in favour of inter-CPA ICCR (§III-H, §III-L). The figure caption should be updated to read "Anchor-Calibrated Five-Way Classifier" or similar, consistent with the §III-L title "Anchor-Based Threshold Calibration and Operational Classifier".

The §III-A second paragraph ("Throughout this paper we use the term non-hand-signed rather than 'digitally replicated'...") is well-positioned and should be kept.

§III-B Data Collection

No issues identified.

§III-C Signature Page Identification

No issues identified. The 98.8% VLM-YOLO agreement footnote is appropriately scoped ("we do not attempt to attribute the residual").

§III-D Signature Detection

No issues identified.

§III-E Feature Extraction

No issues identified.

§III-F Dual-Method Similarity Descriptors

As noted in Main Concern 4: shorten the SSIM and pixel-comparison rebuttal to ~3–4 sentences and move full design-level argument to Appendix B.

§III-G Unit of Analysis and Scope

This section is currently long and contains the "We earlier (v4.0 first draft) listed..." paragraph that explicitly walks through the methodological revision. That paragraph (currently at the end of §III-G, before the sample-size reconciliation) should be deleted. The four-item scope rationale list above it is good and should be kept.

The sample-size reconciliation paragraph (n=150,442 vs n=150,453) is technically necessary but is repeated almost verbatim in §IV-J as a parenthetical. Consider centralising it in §III-G with a forward reference, or in an Appendix B reproducibility note.

§III-H Reference Populations

Replace the opening sentence:

"v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing."

with:

The calibration distinguishes two reference populations: Firm A as a within-Big-4
templated-end case study, and the 249 non-Big-4 CPAs as an out-of-target reference
for internal-consistency checking.

The remainder of §III-H is well-written; the descriptive content is fine. The "v3.x's single-anchor framing" phrase is the only internal-version language that needs removal.

§III-I Distributional Diagnostics

This is the strongest single section in the body. The four sub-diagnostics (dip test, mixture, BD/McCrary, composition decomposition) are tightly organised around one claim: the descriptor distribution does not provide a within-population bimodal antimode. The 2x2 factorial table at §III-I.4 is the empirical centrepiece of the v4 reframe.

One small narrative issue: §III-I.5 ("Conclusion") closes with "§III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode." Remove "v4.0" — write "§III-L develops the anchor-based threshold calibration framework..."

§III-J K=3 as a Descriptive Partition of Firm-Composition Contrast

The section header is clear and the framing ("Both fits are descriptive partitions... not within-population mechanism modes") is correct.

The current closing paragraph references "§III-K" for cross-checks between the box rule and K=3, but §III-K is the next subsection — this is a within-Methodology forward reference and reads slightly oddly. Consider rephrasing as "Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K below."

§III-K Convergent Internal-Consistency Checks

This section is well-handled. The opening caveat — "the three scores are not statistically independent measurements... so their high pairwise rank correlations are partly a mechanical consequence of shared inputs" — is exactly the methodological honesty the v4 reframe needs.

One narrative issue: §III-K.4 (positive-anchor miss rate) and §III-K.3 (LOOO reproducibility) are summarised in §III-K but also reported in detail in §III-J and §IV-G respectively. Consider whether the §III-K subsections add narrative value beyond cross-referencing — if not, §III-K could shrink to just the three-score Spearman block (§III-K.1) and a one-line cross-reference to LOOO and pixel-identity, with the detail living in §III-J and §IV-G / §IV-H.

§III-L Anchor-Based Threshold Calibration and Operational Classifier

This section has the operating-rule text that the Abstract / Introduction handoff explicitly asked for ("Cosine > 0.95 AND dHash ≤ 5" etc., §III-L.0 item 1). Good.

The "Terminological note on FAR" at the end of §III-L.0 is explicit and reviewer-facing. Keep it.

Issues:

"Why retained without v4.0 recalibration" — replace subsection title and contents to remove v4 references. The argument ("the inherited thresholds preserve continuity with prior reporting; §III-I.4 establishes that recalibration cannot be anchored on distributional antimodes; §III-L.1 confirms the cosine threshold's specificity at the inter-CPA pair level is reproducible") is intact without the v4 label.
§III-L.7 ("K=3 not used as classifier") restates content already in §III-J. Consider deleting §III-L.7 and adding a one-line note inside §III-L.0 ("The K=3 mixture of §III-J is used as an accountant-level descriptive summary alongside the per-signature five-way classifier; K=3 hard-posterior membership is not used to assign signature-level or document-level labels in any result table").

§III-M Validation Strategy and Limitations under Unsupervised Setting

Replace the framing as described in Main Concern 2. Keep the underlying disclosure content. Consider whether Table XXVII is best presented as a numbered methodological table or as an Appendix B reproducibility-and-assumption summary; in either case retitle and reframe so that "ten" does not appear and the unifying principle is "each diagnostic addresses one specific unsupervised failure mode."

The "What v4.0 does not claim" and "What v4.0 does claim" subsections at the end of §III-M are strong but the framing tag "v4.0 does not claim" / "v4.0 does claim" is the problematic version-language pattern. Replace with "Limits of the present analysis" and "Scope of the present analysis."

§III-N Data Source and Firm Anonymization

No issues. The residual-identifiability disclosure is appropriately framed.

§IV-A Experimental Setup

No issues identified.

§IV-B Signature Detection Performance

No issues identified.

§IV-C All-Pairs Intra-vs-Inter Class Distribution Analysis

The pairwise-non-independence caveat ("we therefore rely primarily on Cohen's d... A Cohen's d of 0.669 indicates a medium effect size, confirming that the distributional difference is practically meaningful, not merely an artifact of the large sample count") is well-positioned. Keep.

§IV-D Big-4 Accountant-Level Distributional Characterisation

The Table V dip-test row labels are clear. The "v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration" should drop the "v4-new" — just write "...are tabulated in §IV-M below alongside the anchor-based ICCR calibration."

§IV-E Big-4 K=2 / K=3 Mixture Fits

The "descriptive partition; not mechanism clusters per §III-J" labels in Tables VII and VIII are consistent with the v4 reframe. Keep. Drop "(v3.x role)" anywhere it appears.

§IV-F Convergent Internal-Consistency Checks

This is duplicate Results-side reporting of §III-K. Consider whether the duplication adds value or is redundant. If both sections must remain, then §III-K should describe the method (three scores, why they are not independent) and §IV-F should report the numbers; currently §III-K reports both the method and the numbers, leaving §IV-F as a near-duplicate. Recommendation: trim §IV-F to just the per-firm summary table and the Cohen-kappa block, with the method description living in §III-K.

§IV-G Leave-One-Firm-Out Reproducibility

Tables XII and XIII are well-organised. The interpretation paragraph following Table XIII correctly identifies the K=2 vs K=3 contrast (K=2 unstable; K=3 component shape reproducible but hard-posterior membership composition-sensitive). Keep.

§IV-H Pixel-Identity Positive-Anchor Miss Rate

The "close to tautological" caveat is appropriately positioned. Keep. The reverse-anchor cut by prevalence calibration disclosure is also appropriate.

§IV-I Inter-CPA Pair-Level Coincidence Rate

Replace:

"v4.0 retroactively reframes the metric as inter-CPA pair-level coincidence rate (ICCR) rather than 'False Acceptance Rate' because..."

with:

The metric reported here is the inter-CPA pair-level coincidence rate (ICCR). It
is the per-pair rate at which two signatures from different CPAs satisfy the
deployed rule. We do not label it as a False Acceptance Rate because (a) FAR has
a biometric-verification meaning that requires ground-truth negative labels, and
(b) the inter-CPA negative-anchor assumption is partially violated by within-firm
cross-CPA template-like collision structures (§III-L.4 cross-firm hit matrix).

§IV-J Five-Way Per-Signature + Document-Level Classification Output

The sample-size reconciliation parenthetical ("11 of 150,453 loaded Big-4 signatures lacked one or both descriptors and were excluded") is repeated from §III-G. Centralise once and forward-reference.

"v4.0 does not change this aggregation rule; only the population over which it is computed changes" should be "The aggregation rule is the inherited worst-case rule (HC > MC > HSC > UN > LH); we apply it to the Big-4 sub-corpus."

The MC band capture-rate inheritance disclosure is appropriately framed but should drop the "v4.0 does not re-derive" phrasing; rewrite as "The moderate-confidence band's calibration and capture-rate evidence is reported in [Appendix B / v3.20.0 Tables IX, XI, XII, XII-B] and is not regenerated on the Big-4 subset."

§IV-K Full-Dataset Robustness

The scope-of-§IV-K paragraph ("The scope of §IV-K is deliberately narrow: we re-run only the K=3 mixture + Paper A operational-rule per-CPA less-replication-dominated rate analysis...") is defensively framed but the substance is correct. Consider shortening the "what we do not do" enumeration and emphasising the "what we do show" finding (K=3 + Paper A box-rule Spearman convergence preserved at full scope; ρ drift = 0.007).

§IV-L Feature Backbone Comparison

This is inherited v3.x content. The "inherited unchanged from the v3.20.0 backbone-ablation table" framing is acceptable here because it is a methodological choice (do not re-run the ablation at the Big-4 scope) rather than a narrative pivot. Keep.

§IV-M v4-New Anchor-Based ICCR Calibration Results

Drop the "v4-new" from the section heading. Recommended replacement heading: "Anchor-Based ICCR Calibration Results".

The section is empirically dense and methodologically sound. Tables XXI–XXVI cover the four units (per-comparison, per-signature, per-document, firm logistic + hit matrix) and the alert-rate sensitivity. Keep all tables. Drop "v4 new" / "v4-new" wherever it appears as a row qualifier or section subheading.

§V-A Non-Hand-Signing Detection as a Distinct Problem

Keep. This section preserves the forgery distinction (Main concern #2 in the Abstract / Introduction handoff).

§V-B Per-Signature Similarity is a Continuous Quality Spectrum

Replace the v3-to-v4 opening:

"A central empirical finding of v3.x was that per-signature similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading."

with:

The Big-4 accountant-level descriptor distribution rejects unimodality on both
marginals at p < 5 × 10⁻⁴ (§IV-D Table V). The composition decomposition of
§III-I.4 (Scripts 39b–39e) shows this rejection is fully attributable to two
non-mechanistic sources...

This preserves the §V-B content while removing the v3.x lineage statement.

§V-C Firm A as the Templated End of Big-4

Replace "In v4.0 we treat Firm A as a templated-end case study rather than as the calibration anchor for the operational threshold" with "We treat Firm A as a templated-end case study within the Big-4 sub-corpus rather than as the calibration anchor for the operational threshold."

Drop the "the v3.x role of Firm A" historical sub-clause that appears in §III-G item 2.

The Firm A byte-level pixel-identity reference (145 signatures across ~50 distinct partners; 35 byte-identical matches across fiscal years) is inherited from v3.x §IV-F.1 / Script 28 — this byte-level granularity is the strongest single piece of v3.x evidence that should survive into v4 because it directly supports the §V-C templated-end characterisation. Keep the reference but recast as "Byte-level decomposition of these 145 signatures (Appendix B) shows..." rather than the current "The additional v3.x finding... is inherited from v3.20.0 §IV-F.1 / Script 28..."

§V-D K=2 / K=3 as Descriptive Firm-Compositional Partitions

Keep. The contrast between K=2 instability and K=3 reproducible-component-shape-but-composition-sensitive-membership is one of the cleanest narrative arcs in the paper.

§V-E Three-Score Convergent Internal-Consistency

Keep. The "not statistically independent" caveat is correctly positioned. The within-Big-4 non-Firm-A disagreement between Score 2 and Scores 1/3 is correctly disclosed.

§V-F Anchor-Based Multi-Level Calibration

Keep. This is the v4 contribution. Drop any residual "v4" labels.

§V-G Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor

Keep. The "positive necessary but not sufficient" caveat and the "specificity proxy under a partially-violated assumption" framing are exactly right.

Drop "Inherited" from the §V-G section heading — the heading currently reads "Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate", which encodes the v3-to-v4 history in the section title itself. Recommended: "Pixel-Identity Positive Anchor and Inter-CPA Coincidence-Rate Negative Anchor".

§V-H Limitations

Reorganise as described in Main Concern 3: primary (3) / secondary (4) / documented features (2) / inherited engineering (5).

Drop "inherited from v3.20.0 §V-G" qualifiers — the limitation either applies to the pipeline or it does not; the version source is reproducibility metadata that belongs in Appendix B.

§VI Conclusion

Replace the opening framing:

"We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope."

with:

We present a fully automated pipeline for detecting non-hand-signed CPA
signatures in Taiwan-listed financial audit reports, together with an
anchor-calibrated screening framework that characterises the pipeline's
operational behaviour at the Big-4 sub-corpus scope under explicit unsupervised
assumptions.

The eight numbered contributions are content-correct but presented in flat-list form. Consider grouping into three thematic clusters:

Why the descriptor distribution does not anchor a natural threshold (contributions 1, 5).
How the deployed rule is calibrated under unsupervised constraints (contributions 2, 6, 7).
What the calibration reveals about firm heterogeneity (contributions 3, 4).
Methodological positioning (contribution 8 — but reframe per Main Concern 2).

The Future Work block (four items) is fine; consider trimming the second item ("a separate study to distinguish deliberate template sharing from passive firm-level production artefacts") which is the only item that involves additional fieldwork rather than methodological extension.

Recommended Minimum Patch List

Strip v3-to-v4 revision language throughout §III, §IV, §V, §VI. Mechanical pass on "v4.0", "v3.x", "v4-new", "inherited", "earlier work in this lineage". Replace with present-tense descriptions of the final methodological choice and forward references to Appendix B for reproducibility provenance.
Retire the "ten-tool unsupervised-validation collection" framing in §III-M and the "multi-tool framework" phrase in §VI Conclusion. Replace with "each diagnostic addresses one specific unsupervised failure mode" framing. Retitle Table XXVII so that "ten" does not appear.
Reorganise §V-H Limitations into primary / secondary / documented-features / inherited-engineering groupings.
Shorten §III-F SSIM and pixel-comparison rebuttal to ~3–4 sentences; move design-level discussion to Appendix B.
Update Figure 1 caption (currently in §III-A commented HTML) to remove "Firm A P7.5-anchored" residual v3 language.
Rewrite the §IV opener paragraph to remove the inherited-vs-v4-new section labels.
Rewrite the §IV-I opening paragraph to remove "v4.0 retroactively reframes the metric...".
Drop "v4-new" from the §IV-M section heading; replace with "Anchor-Based ICCR Calibration Results".
Centralise the n=150,442 vs n=150,453 sample-size reconciliation in §III-G; remove the duplicate parenthetical from §IV-J.
Consider trimming §IV-F to numbers-only (per-firm summary table + Cohen kappa), with the method description living in §III-K.
Consider deleting §III-L.7 (duplicate of §III-J K=3-not-used-as-classifier claim) and adding a one-line note in §III-L.0.

Reviewer Bottom Line

The body sections of v4 are empirically defensible and methodologically internally consistent. The required revisions are stylistic and structural rather than substantive:

Remove the v3-to-v4 revision narrative from the present-tense exposition.
Reframe the supporting diagnostics from "ten-tool collection" to "each diagnostic addresses one unsupervised failure mode."
Reorganise the limitations list so that the load-bearing limitations are visibly more prominent than the routine engineering caveats.
Move provenance and reproducibility detail to Appendix B / supplementary material.

These changes preserve every quantitative claim and every disclosure currently in the manuscript. They tighten the narrative voice so that the reader experiences the v4 methodological choices as the final state of the design rather than as an ongoing argument with an earlier version. Combined with the Abstract / Introduction patches in the companion handoff, the manuscript should read as a single coherent submission rather than as a layered revision document.

Additional Cross-Cutting Observation: Script Provenance in Tables

Across §III, §IV, §V, and the Conclusion, tables are annotated with (Source: Script 32 / 34 / 35 / 38 / 40b / 43 / 44 / 45 / 46) parentheticals. This is appropriate for reproducibility but heavy at the visual level — every table footer in §IV-D through §IV-M carries one of these annotations.

Recommended consolidation: move the script-to-table mapping to a single Appendix B reproducibility table ("Table B-1. Script-to-table provenance map"), and replace the inline annotations with a single one-line note at the start of §IV ("Script-to-table provenance is summarised in Appendix B Table B-1; raw outputs are available in the supplementary repository").

This is a minor change but it materially reduces the visual signal that the paper is built on a large number of separate scripts.

Closing Note

This review covers the body sections only. The Abstract / Introduction handoff (paper/review_handoff_abstract_intro_20260515.md) covers the front matter. The two handoffs should be applied together; applying only one of them will produce tonal mismatch as the reader moves from the front matter into the body.

The References and the Appendix have not been reviewed and may benefit from a separate handoff if the Appendix is to absorb the SSIM / pixel-comparison material and the reproducibility-provenance table recommended above.

30 KiB Raw Blame History Unescape Escape

Review Handoff: Methodology, Results, Discussion, Conclusion

Overall Assessment

Main Reviewer Concerns

1. The v3-to-v4 revision narrative is pervasive in the body and must be removed

2. The "Ten-Tool Unsupervised-Validation Collection" frame must be retired

3. The §V-H Limitations list is correct but defensively ordered

4. §III-F SSIM and pixel-comparison justification is too long for Methodology

5. §IV's section opener still encodes provenance not appropriate to a Results section opener

Section-by-Section Comments

§III-A Pipeline Overview

§III-B Data Collection

§III-C Signature Page Identification

§III-D Signature Detection

§III-E Feature Extraction

§III-F Dual-Method Similarity Descriptors

§III-G Unit of Analysis and Scope

§III-H Reference Populations

§III-I Distributional Diagnostics

§III-J K=3 as a Descriptive Partition of Firm-Composition Contrast

§III-K Convergent Internal-Consistency Checks

§III-L Anchor-Based Threshold Calibration and Operational Classifier

§III-M Validation Strategy and Limitations under Unsupervised Setting

§III-N Data Source and Firm Anonymization

§IV-A Experimental Setup

§IV-B Signature Detection Performance

§IV-C All-Pairs Intra-vs-Inter Class Distribution Analysis

§IV-D Big-4 Accountant-Level Distributional Characterisation

§IV-E Big-4 K=2 / K=3 Mixture Fits

§IV-F Convergent Internal-Consistency Checks

§IV-G Leave-One-Firm-Out Reproducibility

§IV-H Pixel-Identity Positive-Anchor Miss Rate

§IV-I Inter-CPA Pair-Level Coincidence Rate

§IV-J Five-Way Per-Signature + Document-Level Classification Output

§IV-K Full-Dataset Robustness

§IV-L Feature Backbone Comparison

§IV-M v4-New Anchor-Based ICCR Calibration Results

§V-A Non-Hand-Signing Detection as a Distinct Problem

§V-B Per-Signature Similarity is a Continuous Quality Spectrum

§V-C Firm A as the Templated End of Big-4

§V-D K=2 / K=3 as Descriptive Firm-Compositional Partitions

§V-E Three-Score Convergent Internal-Consistency

§V-F Anchor-Based Multi-Level Calibration

§V-G Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor

§V-H Limitations

§VI Conclusion

Recommended Minimum Patch List

Reviewer Bottom Line

Additional Cross-Cutting Observation: Script Provenance in Tables

Closing Note

30 KiB

Raw Blame History