Apply codex round-26 corrections to Phase 4 prose v2

Codex round 26 returned Major Revision on Phase 4 v1: 9 Major findings + 12 Minor + reviewer-attack vulnerabilities. v2 applies all flagged corrections. Abstract changes: - "Three independent feature-derived scores" -> "Three feature-derived scores ... not statistically independent because all three are functions of the same descriptor pair". Names the operational output as the inherited five-way classifier. - Trimmed from 277 to ~245 words to stay within IEEE Access 250-word limit while keeping all numerical anchors. §I Introduction: - Line 29 cross-ref §III-D -> §III-G through §III-J (§III-D was wrong; the methodology lives in §III-G/I/J). - Big-4 scope claim narrowed: "neither any single firm pooled alone nor the broader full-dataset variant rejects" -> "none of the narrower comparison scopes tested in Script 32 rejects" with explicit enumeration (Firm A pooled alone; Firms B+C+D pooled; all non-Firm-A pooled). - "Three independent feature-derived scores" -> "Three feature-derived scores ... not statistically independent". - Contribution 4 "not at narrower scopes" -> "not in the narrower comparison scopes tested". - Contribution 8 "demonstrating pipeline reproducibility at multiple scopes" -> narrowed to "K=3 + box-rule rank-convergence reproduces at full n=686; does not re-validate operational thresholds / LOOO / five-way / pixel identity at the broader scope". - "external validation" softened to "annotation-free validation" in methodological-safeguards paragraph. - "(5)–(8)" pipeline stage list updated with corrected section references. - "Published box rule" -> "inherited Paper A box rule". - Added Big-4 pixel-identity per-firm breakdown (145/8/107/2) in §I body for completeness. §II Related Work: - Replaced placeholder with explicit defer-to-master statement: v3.20.0 §II is inherited substantively unchanged in the master manuscript; only the LOOO addition is reproduced here. - "[add citation]" replaced with placeholder references [42] Stone 1974, [43] Geisser 1975, [44] Vehtari et al. 2017 explicitly marked as draft references to be finalised at copy-edit time. - LOOO addition reframed: composition-sensitivity band on the mixture characterisation, not on the operational classifier. §V Discussion: - §V-B "v4.0 inherits and confirms" softened to "v4.0 inherits this signature-level reading and remains consistent with it (no signature-level diagnostic was newly run in v4)". - §V-B "some CPAs are templated, some are hand-leaning, some are mixed" rewritten as component-membership wording: "some CPAs' observed signatures place their per-CPA means in the templated/mixed/hand-leaning region of the descriptor plane". - §V-B within-CPA unimodality explanation softened from "produces" to "can be jointly consistent" with explicit §III-G cross-ref. - §V-C Firm A byte-level provenance: 145 pixel-identical signatures verified in Script 40; 50 partners / 35 cross-year explicitly inherited from v3 / Script 28 not regenerated in v4 spikes. - §V-C "anchors §IV-H's positive-anchor miss-rate" -> "is the largest of the four Big-4 subsets, with full anchor pooling Firm A 145, Firm B 8, Firm C 107, Firm D 2". - §V-E "published box rule" -> "inherited Paper A box rule"; "produce the same per-CPA ranking" -> "broadly concordant rankings, with residual non-Firm-A disagreement". - §V-G limitations expanded from 7 to 12 items: restored the 5 v3.20.0 inherited limitations (transferred ImageNet features, HSV stamp-removal artifacts, longitudinal scan confounds, source-exemplar misattribution, legal interpretation). - §V-G scope limitation: removed unsupported "narrower or broader scopes" full-dataset dip-test claim. §VI Conclusion: - Names operational output: "inherited Paper A five-way per-signature classifier with worst-case document-level aggregation". - "Cross-scope pipeline reproducibility" -> "K=3 + box-rule rank-convergence reproduces at full n=686; does not re-validate operational thresholds, LOOO, five-way classifier, or pixel-identity at the broader scope". - Future-work direction 3 explicitly qualifies the within-Big-4 contrast as "accountant-level descriptive features of the K=3 mixture, not validated mechanism-level claims and not currently linked to audit-quality outcomes". Round 26 closure post-v2: - All 9 Major findings: CLOSED in v2 prose body. - All 12 Minor findings: CLOSED in v2 prose body. - Phase 5 readiness: should now move from Partial to Yes pending codex round 27 verification. Provenance: codex round-26 confirmed 17/17 numerical claims in Phase 4 v1 (only finding #5, the scope-test wording, was an overclaim rather than a numerical error). v2 keeps all confirmed numerics and narrows only the scope-test wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 23:50:09 +08:00
parent e36c49d2d8
commit 10c82fd446
2 changed files with 194 additions and 29 deletions
@@ -0,0 +1,157 @@
+# Paper A Round 26 Review - v4 round 6
+
+Reviewer: gpt-5.5 xhigh  
+Date: 2026-05-12  
+Target: `paper/v4/paper_a_prose_v4_phase4.md` (Phase 4 prose draft v1)  
+Foundation checked: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v6) and `paper/v4/paper_a_results_v4_section_iv.md` (§IV v3.2)  
+Trajectory checked: rounds 21-25 plus v3.20.0 Abstract / §I / §II / §V / §VI baselines
+
+## Verdict
+
+Major Revision.
+
+The technical core in §III v6 and §IV v3.2 is stable, but the new Phase 4 prose introduces several reviewer-visible regressions. The most important are: (i) the Abstract and Introduction revive the "independent scores" overclaim even though §III/§IV repeatedly say the three scores are not statistically independent; (ii) §I and §V overstate the Big-4 scope evidence by claiming unsupported single-firm and full-dataset dip-test non-rejections; (iii) §II is still a placeholder with `[add citation]`, not a submission-ready related-work section; and (iv) §V-G drops several inherited limitations from v3.20.0.
+
+## Section-By-Section Findings
+
+### Abstract
+
+1. **Major - line 11: "Three independent feature-derived scores" contradicts the converged methodology.** §III-K states that the three scores are "not statistically independent measurements" because all are deterministic functions of the same descriptor means (§III:90), and §IV-F repeats the caveat (§IV:79). The Abstract should say "three feature-derived scores" or "three non-identical feature-derived summaries" and, if space allows, add the shared-feature caveat.
+
+2. **Minor - line 11: "candidate classifiers" can be read as operational-classifier language.** One of the three "candidate classifiers" is the K=3 per-CPA hard label, which §III-J/§III-L explicitly demotes to descriptive characterisation, not operational signature/document classification (§III:64, §III:156). Use "candidate rules/scores" or explicitly reserve "operational classifier" for the inherited five-way box rule.
+
+3. **Minor - line 11: the Abstract passes IEEE Access form but has no margin.** It is one paragraph and `wc -w` counts 247 words, so it satisfies the <=250-word target. Any added caveat will require trimming elsewhere.
+
+4. **Minor - line 11: the Abstract does not name the primary operational output.** The abstract describes the pipeline and the K=3 / convergence / anchor checks, but it does not state that the primary operational output remains the inherited five-way per-signature classifier with worst-case document aggregation (§III-L; §IV-J). This omission makes the K=3 and reverse-anchor checks look more central operationally than §III/§IV allow.
+
+### §I Introduction
+
+1. **Major - line 31: the Big-4 scope claim is overbroad and partly unsupported.** The sentence says "neither any single firm pooled alone nor the broader full-dataset variant rejects unimodality." §III and §IV only report comparison dip tests for Firm A alone, Firms B+C+D pooled, and all non-Firm-A pooled (§III:34, §III:56; §IV:27-34). They explicitly state that single-firm dip tests for Firms B, C, and D were not separately computed (§III:34, §III:56; §IV:34). §IV-K is a light full-dataset K=3 + Spearman robustness check and does not report a full-dataset dip test (§IV:230-252). Rewrite this as "no narrower comparison scope tested in Script 32..." and remove the full-dataset dip-test claim unless a spike report is added.
+
+2. **Major - line 29: the section cross-reference for accountant-level distributional characterisation is wrong.** The prose points to "§III-D" for the Big-4 accountant-level distributional characterisation. In the converged methodology, this material is §III-G through §III-J, especially §III-I and §III-J (§III:18-86). §IV-D/§IV-E are correct.
+
+3. **Major - line 35: the Introduction repeats the "independent feature-derived scores" error.** The next sentence correctly says the scores are not statistically independent, but the opening clause still hands reviewers an avoidable contradiction. This was a central round-21/22 issue and should not reappear in the front matter.
+
+4. **Minor - line 47: contribution 4 again overstates "not at narrower scopes."** The defensible phrase is "not in the narrower comparison scopes tested" because B/C/D single-firm dip tests were not computed.
+
+5. **Minor - line 55: contribution 8 overclaims the full-dataset check.** §IV-K deliberately re-runs only K=3 + Paper A box-rule Spearman convergence at full `n = 686`; it does not re-run LOOO, five-way moderate-band validation, or operational threshold calibration (§IV:230). "Pipeline reproducibility at multiple scopes" should be narrowed to "the K=3 + box-rule rank-convergence check reproduces at the full-CPA scope."
+
+6. **Minor - line 25: the methodological safeguards paragraph uses "external validation" too broadly.** The pixel-identity anchor is a conservative positive-subset check, the inter-CPA FAR is inherited corpus-wide, and LOOO is descriptive composition-sensitivity evidence. The paragraph should avoid implying full external validation of the operational classifier.
+
+### §II Related Work
+
+1. **Major - lines 63-65: §II is not submission-ready prose if inserted as written.** The section says v3.20.0 §II is retained "without substantive change," but the target Phase 4 file is supposed to replace the §II block. As written, it is a meta-summary rather than an actual Related Work section. Either the master manuscript must keep the full v3.20.0 §II text and splice in the LOOO paragraph, or this file must contain the full revised §II.
+
+2. **Major - line 67: unresolved citation placeholder.** "`[add citation]`" is still present. This must be replaced before Phase 5; otherwise a reviewer can attack the only new Related Work content as uncited.
+
+3. **Minor - line 67: "calibration uncertainty band on the operational rule" conflicts with the converged classifier framing.** §III-J says neither K=2 nor K=3 is used as an operational classifier (§III:64), and §III-L reserves operational classification for the inherited five-way box rule (§III:138-156). If the LOOO paragraph is about K=2/K=3 mixture fits, call it a composition-sensitivity or calibration-uncertainty check on the candidate mixture boundary/characterisation, not on "the operational rule."
+
+### §V Discussion
+
+1. **Major - line 81: the prose reifies mechanism labels at the CPA level.** "Some CPAs are templated, some are hand-leaning, some are mixed" is stronger than §III allows. §III-G says a per-CPA mean is a summary statistic, not a claim that all signatures for that CPA share a mechanism (§III:22). Use component-membership wording: "some CPAs' observed signatures place their per-CPA means in the templated, mixed, or hand-leaning regions."
+
+2. **Major - line 81: the within-CPA unimodality explanation is speculative.** The claim that occasional template reuse "produces a unimodal per-signature distribution within the CPA but a multimodal per-CPA distribution across CPAs" is not directly tested in §III/§IV. v3.x tested Firm A and all-CPA signature-level distributions, and v4.0 adds per-signature K=3 consistency (§IV-F), but there is no per-CPA distributional test for individual CPAs.
+
+3. **Major - lines 103-119: limitations are incomplete relative to v3.20.0 and the inherited pipeline.** The v4 limitations keep the Big-4 scope, missing hand-signed ground truth, pixel-identity subset, inherited-rule, A1, K=3 composition, and no-intent caveats. They drop v3 limitations that still apply: ImageNet-pretrained ResNet-50 without signature-domain fine-tuning (v3 §V:90-92), HSV red-stamp removal artifacts (v3 §V:93-95), longitudinal scanning/PDF/compression confounds (v3 §V:97-99), source-exemplar misattribution in max/min pair logic (v3 §V:100-102), and legal/regulatory interpretation limits (v3 §V:108-109). If these are intentionally retired, the draft needs a reason; otherwise they should be restored.
+
+4. **Major - line 107: the scope limitation repeats the unsupported full-dataset dip-test implication.** The sentence says dip-test multimodality is "not available at narrower or broader scopes." §III/§IV do not report full-dataset dip-test results; §IV-K is explicitly a light Spearman robustness check (§IV:230-252). Keep the LOOO broader-scope caveat, but do not claim full-dataset dip-test non-availability without evidence.
+
+5. **Minor - line 79: "v4.0 inherits and confirms" is too strong for the per-signature continuous-spectrum reading.** The exact v3 per-signature diagnostic package is inherited; v4.0's new per-signature evidence is mostly the K=3 consistency check (§IV-F) and five-way output (§IV-J). Safer: "v4.0 inherits this signature-level reading and remains consistent with it."
+
+6. **Minor - line 85: inherited Firm A byte-level details need provenance language.** The 145 Firm A pixel-identical signatures are verified in Script 40, but the "50 distinct partners" and "35 cross-year" details are explicitly inherited from v3 / Script 28 and not regenerated in v4.0 (§III:44, §III:190). The discussion should mark that provenance, especially because the spike reports provided for v4 only verify the 145 count.
+
+7. **Minor - line 87: Firm A does not alone anchor §IV-H.** §IV-H's positive-anchor subset is all Big-4 byte-identical signatures, `n = 262`, split 145 / 8 / 107 / 2 across Firms A-D (§IV:145-153). Firm A is the largest subset and the case-study evidence, but not the whole anchor.
+
+8. **Minor - line 97: "published box rule" is not traceable.** §III/§IV call this the inherited Paper A / v3.x box rule, not a published external rule (§III:96, §III:138; §IV:85-87). Use "inherited box rule" unless there is a publication citation.
+
+9. **Minor - line 97: "produce the same per-CPA ranking" is stronger than the evidence.** The scores are highly correlated, but §III/§IV note a residual non-Firm-A disagreement: reverse-anchor ranks Firm D fractionally above Firm C while P(C1) and box-rule hand-leaning rate rank Firm C highest (§III:106; §IV:102). Say "broadly concordant ranking."
+
+10. **Minor - line 101: "candidate classifiers" again blurs operational status.** K=3 hard labels remain descriptive. This can be fixed together with the Abstract wording.
+
+### §VI Conclusion And Future Work
+
+1. **Major - line 127: "cross-scope pipeline reproducibility" overstates §IV-K.** The full-dataset result verifies only that K=3 P(C1) and Paper A hand-leaning-rate Spearman convergence remains high at `n = 686` with drift `0.0069` (§IV:242-250; full-dataset report:25-31). It does not reproduce the pipeline, the five-way classifier, the moderate-confidence band, LOOO, or operational thresholds at full scope.
+
+2. **Minor - line 129: the future-work audit-quality contrast must stay explicitly descriptive.** "Firm A's 82% templated concentration vs Firm C's 23.5% hand-leaning concentration" comes from K=3 hard-posterior accountant-level assignment (§IV:215-224), whose membership is composition-sensitive (§IV:129-139). The future-work sentence is acceptable if it says these are descriptive component concentrations and that current Paper A provides no audit-quality correlation evidence.
+
+3. **Minor - lines 125-127: the conclusion underplays the actual operational output.** It names the pipeline and methodological checks, but it does not mention the inherited five-way per-signature/document-level classifier that §III-L and §IV-J define as the operational output. This is not a numerical error, but it leaves the operational-vs-descriptive distinction less clear at closure.
+
+## Reviewer-Attack Vulnerabilities Specific To The Prose
+
+1. A reviewer can quote line 11 or line 35 ("independent feature-derived scores") against §III-K/§IV-F's non-independence caveat and argue that the paper exaggerates validation strength.
+
+2. A reviewer can attack the Big-4 scope claim because the prose says "any single firm" and "full-dataset variant" even though B/C/D single-firm dip tests and full-dataset dip tests are not reported.
+
+3. The current §II can be rejected as incomplete because it is a placeholder, not a related-work section, and includes `[add citation]`.
+
+4. "Published box rule" invites a citation challenge. The body only supports "inherited Paper A / v3.x box rule."
+
+5. The discussion sometimes turns descriptive component labels into apparent mechanism claims about CPAs. This conflicts with the §III-G rule that per-CPA means are summaries, not partner-level mechanism assignments.
+
+6. The phrase "candidate classifiers" for K=3 and reverse-anchor checks can be read as walking back the round-21 convergence that K=3 is descriptive and the five-way box rule is operational.
+
+7. The limitations section is vulnerable because it drops inherited limitations that still apply to the pipeline: feature backbone transfer, red-stamp preprocessing, longitudinal document-generation shifts, source-exemplar misattribution, and legal interpretation limits.
+
+8. The full-dataset robustness claim is easy to overread. §IV-K is intentionally "light scope"; calling it pipeline reproducibility or cross-scope operational reproducibility exceeds the evidence.
+
+## Provenance Verification Table
+
+| # | Phase 4 numerical claim | Phase line(s) | Provenance checked | Status |
+|---:|---|---:|---|---|
+| 1 | Abstract is <=250 words | 11 | `sed -n '11p' ... \| wc -w` returned 247 | Confirmed, but close to limit |
+| 2 | 90,282 reports, 182,328 signatures, 758 CPAs | 11, 37, 125 | §IV:7 gives 90,282 PDFs; §IV:13 gives 182,328 extracted signatures; v3 §I:62 gives 758 CPAs | Confirmed with inherited full-corpus CPA source |
+| 3 | Big-4 sub-corpus: 437 CPAs, 150,442 signatures | 11, 37, 125 | §III:30; §IV:9, §IV:15; five-way report:14-15 | Confirmed |
+| 4 | Big-4 dip-test multimodality, `p < 5 x 10^-4` on both axes | 11, 31, 81, 127 | §III:34, §III:56, §III:171-172; §IV:27-34 | Confirmed for Big-4 |
+| 5 | "Neither any single firm pooled alone nor broader full-dataset variant rejects" | 31 | §III:34/56 and §IV:34 say only Firm A alone was tested among single firms; §IV-K has no full-dataset dip test | Not verified / overclaimed |
+| 6 | K=2 crossings `cos*=0.9755`, `dHash*=3.755`, cosine CI half-width 0.0015 | 31 | calibration report:16-17; §III:58, §III:166-170; §IV:60-63 | Confirmed |
+| 7 | K=2 LOOO max cosine-crossing deviation `0.028`, `5.6x` tolerance, Firm A held-out 100% vs non-A 0% | 31, 91 | calibration report:34-44; §III:78, §III:120; §IV:122-127 | Confirmed, with 0.0278 rounded to 0.028 |
+| 8 | K=3 components: C3 `0.983/2.41/0.321`, C2 `0.956/6.66/0.536`, C1 `0.946/9.17/0.143` | 33 | k3 LOOO report:8-10; convergence report:8-12; §III:70-76; §IV:69-75 | Confirmed after rounding |
+| 9 | K=3 C1 LOOO shape drift: cos <=0.005, dHash <=0.96, weight <=0.023 | 11, 33, 93, 127 | k3 LOOO report:77-79; §III:78, §III:122; §IV:139 | Confirmed |
+| 10 | K=3 held-out hard-posterior differences `1.8-12.8 pp` | 33, 93, 117 | k3 LOOO report:83-90; §III:122; §IV:134-139 | Confirmed after rounding |
+| 11 | Three-score Spearman convergence `rho >= 0.879` | 11, 35, 51, 97, 127 | convergence report:28-30; §III:100-104; §IV:83-87 | Confirmed numerically; wording must not say independent |
+| 12 | Per-signature K=3 consistency `Cohen kappa = 0.87` | 97 | §III:108-116; §IV:104-112 | Confirmed |
+| 13 | Pixel-identity subset `n = 262`, all three checks 0% miss, Wilson upper 1.45% | 11, 35, 53, 101, 127 | pixel-identity report:8, 14-16; §III:124-132; §IV:145-153 | Confirmed |
+| 14 | Firm A pixel-identical `145`, plus `50 partners` and `35 cross-year` | 85 | pixel-identity report:24 confirms 145; §III:44 and §III:190 mark 50/35 as inherited from v3 / Script 28, not regenerated in v4 spikes | Partially confirmed; provenance caveat needed |
+| 15 | Inter-CPA FAR `0.0005`, Wilson `[0.0003, 0.0007]` | 53, 101 | §III:188; §IV:157-159; inherited v3.20.0 §IV-F.1 Table X | Confirmed as inherited |
+| 16 | Full-dataset robustness `n = 686`, full rho `0.9558`, drift `0.007` | 11, 55, 107, 127 | full-dataset report:10-13, 25-31; §III:186; §IV:242-250 | Confirmed numerically, but interpretive scope is light |
+| 17 | Firm A `82%/82.5%` templated and Firm C `23.5%` hand-leaning | 85, 129 | convergence report:43-48; §IV:217-224 | Confirmed as descriptive K=3 hard assignment |
+
+## Cross-Reference Checks (Phase 4 <-> §III v6 / §IV v3.2)
+
+| Linkage | Phase 4 evidence | §III / §IV evidence | Status |
+|---|---:|---:|---|
+| Big-4 primary scope and sample size | Lines 11, 31, 37, 107, 125 | §III:30; §IV:9, §IV:15 | Numerically tight, but scope-test wording overbroad |
+| Accountant-level distributional characterisation refs | Line 29 | §III-I/J are the relevant methodology sections (§III:52-86); §IV-D/E correct (§IV:21-75) | Fail: `§III-D` is stale/wrong |
+| K=2 as firm-mass separator, not operational | Lines 31, 91 | §III:78-86, §III:120; §IV:118-127 | Tight |
+| K=3 descriptive only | Lines 33, 49, 93 | §III:64, §III:80-86, §III:156; §IV:75, §IV:139, §IV:224 | Tight, except "candidate classifier" wording |
+| Three-score internal consistency | Lines 11, 35, 51, 97, 127 | §III:90-106; §IV:79-102 | Numerically tight; independence wording fails |
+| Reverse-anchor reference as non-Big-4 | Lines 35, 97 | §III:48-50; §IV:89 | Tight |
+| Pixel-identity positive anchor | Lines 35, 101 | §III:124-134; §IV:141-155 | Tight; Firm A-only anchoring phrase should be narrowed |
+| Inter-CPA negative-anchor FAR | Lines 53, 101 | §III:126, §III:188; §IV:157-159 | Tight as inherited |
+| Five-way classifier primary / MC band inherited | Lines 33, 113 | §III:136-156; §IV:161-224 | Mostly tight; Abstract/Conclusion should name operational output more clearly |
+| Full-dataset robustness | Lines 55, 107, 127 | §IV:228-252 | Numerically tight; "pipeline reproducibility" overclaims light scope |
+| Internal notes and close-out artifacts | Lines 3, 133-142 | Round-25 review kept this open; §III and §IV also retain internal notes | Not partner/Phase-5 ready |
+
+## Phase 5 Readiness
+
+Partial.
+
+The §III/§IV technical foundation would likely survive cross-AI peer review, but the current Phase 4 prose would draw a Major Revision because it reintroduces known overclaims and has an incomplete §II. With the targeted prose repairs below, Phase 5 readiness should move to Yes.
+
+## Recommended Next-Step Actions
+
+1. Replace every "independent feature-derived scores" phrase with "three feature-derived scores" or "three feature-derived summaries," and preserve the shared-feature caveat in Abstract/§I/§V/§VI.
+
+2. Rewrite the Big-4 scope language at lines 31, 47, 81, 107, and 127 to match §III exactly: Big-4 is the smallest scope among the comparison scopes tested; B/C/D single-firm dip tests were not computed; no full-dataset dip-test result is reported.
+
+3. Fix stale cross-references in line 29: use §III-G/I/J/K as appropriate instead of §III-D.
+
+4. Turn §II into a real revised Related Work section: retain the v3.20.0 subsections in the master, splice in the LOOO paragraph, and replace `[add citation]` with a specific cross-validation citation.
+
+5. Rebuild §V-G limitations by merging the v4-specific limitations with still-valid v3 limitations: transferred ResNet-50 features, HSV stamp-removal artifacts, longitudinal scan/PDF confounds, source-exemplar misattribution, and legal/regulatory interpretation.
+
+6. Replace "published box rule" with "inherited Paper A box rule" unless an external publication citation is added.
+
+7. Narrow full-dataset language: say "K=3 + box-rule rank-convergence reproduces at full `n = 686`" rather than "pipeline reproducibility at multiple scopes."
+
+8. Before Phase 5, strip the Phase 4 draft note and close-out checklist (lines 3 and 133-142), and continue the same cleanup for §III/§IV internal notes flagged in round 25.