Phase 6 round-7 codex 3-axis review fixes: 11 MAJOR + 5 MINOR

Codex GPT-5.5 3-axis peer review (paper/codex_review_gpt55_v4_round_3axis.md)
identified 11 MAJOR + 5 MINOR + 0 BLOCKER on three axes: (1) abstract/body
tone consistency, (2) methodology clarity / v3 residue, (3) no implicit
within-CPA or cross-year signature-consistency assumptions. 13 patches
applied across 4 source files; mirrored in paper_a_v4_combined.md.

Axis 1 (tone consistency between abstract and body):
- S I L33: "resolves the ambiguity" -> "provides complementary evidence
  for screening cases where ... hypotheses diverge"
- S I L35: "disproves the distributional-threshold path" -> "does not
  support the distributional-threshold path"
- S I L37 / S V-F L29: "characterise the deployed five-way classifier
  at three units" -> "characterise the deployed HC sub-rule and
  document-level HC+MC alarm derived from the five-way classifier at
  three units" (consistent with S V-H which says only HC sub-rule and
  HC+MC alarm are re-characterised by the present ICCR battery)
- S I L39 / S V-C / S III-L.4: "consistent with firm-specific template,
  stamp, or document-production reuse mechanisms" -> "consistent with --
  but does not independently establish -- firm-level template-like
  reuse, digitisation-pipeline homogeneity, or signing-style
  homogeneity, which descriptor-only data cannot separate (S V-H)"
  (mirrors abstract)

Axis 2 (methodology clarity / v3 residue):
- S III-G: added unit-bridge sentence distinguishing "descriptor-summary
  units" (signature/accountant) from "operational reporting units"
  (per-comparison/per-signature/per-document, S III-L)
- S III-H.2: "The calibration distinguishes two reference populations"
  -> "The supporting diagnostics use two reference populations" with
  explicit "neither is the calibration anchor"
- S III-L.1: "specificity" -> "ICCR refinement"
- S III-L.2: added "descriptive intuition, not an independence
  assumption used for estimation" caveat after the 1-(1-p)^n form

Axis 3 (no implicit signature-consistency assumptions):
- S III-F: hand-signing motivation rewritten as working hypothesis that
  "the classifier does not require ... to hold for all CPAs"
- S III-G A1: added "A1 does not assume temporal stability of
  handwriting or scanning workflow within or across years"
- S III-H.1: added label-caveat paragraph (operational rule outputs,
  not validated ground-truth classes); HC "strong replication evidence"
  -> "image-similarity evidence consistent with replication"; HSC
  "consistent with a CPA who signs very consistently" -> "mechanism not
  resolved by descriptor data alone"; LH explicitly owns that
  cross-year handwriting drift, scanner workflow change, or template
  variant rotation can also yield low max-cosine within a same-CPA pool
- S III-L.6 / S IV-M.6: "same-CPA repeatability signal" -> "observed
  same-CPA-pool excess ... not attributed to within-CPA handwriting
  repeatability"

Deferred (structural, not single-sentence patch): codex S III-I.2 /
S III-J K=2/K=3 deduplication; codex S III-K LOOO / S III-J duplication.
Both are MINOR stylistic redundancies, not reviewer-rejection risks.

DOCX rebuilt via export_v3.py; v4.0_20260515 file refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 03:11:53 +08:00
parent 3672c9343e
commit becce857e1
8 changed files with 154 additions and 40 deletions
+4 -4
View File
@@ -30,13 +30,13 @@ The contributions of this paper are:
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we support the backbone choice through a feature-backbone ablation.
3. **Dual-descriptor similarity.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash provides complementary evidence for screening cases where *style consistency* and *image reproduction* hypotheses diverge, and we support the backbone choice through a feature-backbone ablation.
4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
4. **Composition decomposition does not support the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; a distributional "natural threshold" reading of the operating points is not empirically supported.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed high-confidence (HC) sub-rule and document-level HC$+$MC alarm derived from the five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-document D2 inter-CPA proxy ICCRs differ by an order of magnitude across firms (Firm A: $0.62$ versus Firms B/C/D: $0.09$$0.16$); a per-signature logistic regression of the any-pair HC hit indicator on firm dummies and centred log pool size confirms the firm gap persists after pool-size control. Cross-firm hit matrix analysis shows within-firm collision concentrations of $98.8\%$ at Firm A and $76.7$$83.7\%$ at Firms B/C/D under the deployed any-pair rule (the stricter same-pair joint event saturates at $97.0$$99.96\%$ within-firm across all four firms); the pattern is consistent with — but does not independently establish — firm-level template-like reuse, digitisation-pipeline homogeneity, or signing-style homogeneity, which descriptor-only data cannot separate (§V-H).
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (interpreted as firm-compositional structure, not as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.