Apply codex round-22 corrections to §III v3 (Minor -> ready)

Codex gpt-5.5 round 22 returned Minor Revision after v2 closed 3/5 Major findings fully and 2/5 partially. Five narrow fixes applied for v3: 1. Per-firm ranking unanimity corrected (v2:93). The reverse- anchor score ranks Firm D fractionally higher than Firm C (-0.7125 vs -0.7672); only Scores 1 and 3 rank Firm C highest. The unanimity claim was wrong; v3 prose now says all three agree on Firm A as most replication-dominated and on the non-Firm-A Big-4 as more hand-leaning, with a modest disagreement on Firm C vs D ordering. 2. "Smallest scope" / "any single firm" overclaim narrowed (v2:21, v2:43). Script 32 only tested Firm A alone, big4_non_A pooled, and all_non_A pooled -- not Firms B, C, D individually. v3 explicitly says "comparison scopes tested in Script 32" and notes single-firm dip tests for B, C, D were not separately computed. 3. K=3 hard label vs posterior in Spearman correctly attributed (v2:143). Script 38 uses the K=3 posterior P(C1), not the hard label, in the internal-consistency Spearman correlations. v3 §III-L now correctly says the hard label is for the §IV cluster cross-tabulation while the posterior is the continuous Score 1 in §III-K. 4. Provenance source for n=150,442 corrected (v2:17, v2:152). Script 39 directly reports this count in its per-signature K=3 fit; Script 38's report does not. v3 cites Script 39 for this number. 5. "Max fold-to-fold deviation" wording made precise (v2:65, v2:107). The $0.028$ value is the max absolute deviation from the across-fold mean (Script 36 stability summary), not the pairwise across-fold range (which is $0.0376 = 0.9756 - 0.9380$). v3 reports both statistics with explicit definitions. Also: draft note at top now records v2 (round-21) and v3 (round-22) revision lineage. Cross-reference index and open- question block retained as author working checklist (will be removed before manuscript submission per codex e7). Outstanding open questions reduced to 3 (codex round-22 view): - Five-way moderate-confidence band: validate in Big-4 specifically (Phase 3 §IV-F work) or document as inherited from v3.x? - Firm anonymisation policy in §IV-V (procedural) - §IV table numbering (procedural; defer until §IV done) Phase 2 §III draft is now Minor-Revision-quality. Ready for Phase 3 (Results regeneration §IV). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:26:02 +08:00
parent 62a22ceb83
commit c8c7656513
2 changed files with 103 additions and 10 deletions
@@ -0,0 +1,87 @@
+# Paper A v4.0 Methodology Section III-G through III-L Peer Review
+
+Reviewer: gpt-5.5 xhigh  
+Date: 2026-05-12  
+Round number: 22 (v4 round 2)  
+Review target: `paper/v4/paper_a_methodology_v4_section_iii.md`
+
+## Verdict
+
+Minor Revision.
+
+v2 closes most of the round-21 blockers: K=3 is no longer the operational classifier, the "independent lenses" claim is softened, the pixel-identity metric is no longer called FAR in the draft, and the main empirical slips are corrected. The remaining issues are narrower but still need edits before accepting the methodology text, especially the false per-firm ordering claim in §III-K and the unresolved validation status of the five-way moderate-confidence band.
+
+## Round-21 finding closure table
+
+| Finding | Round-21 Severity | v2 Status | Evidence in v2 |
+|---|---|---|---|
+| M1. K=3 is not justified as an operational classifier. | Major | CLOSED | v2 explicitly says both K=2 and K=3 are descriptive and not used for signature/document labels (v2:51, v2:67-73, v2:143). It also reports Script 37 `P2_PARTIAL` and the "not predictively useful as an operational classifier" implication (v2:65, v2:109). |
+| M2. "Three independent lenses" overstates independence and validation strength, and reverse-anchor direction was wrong. | Major | PARTIAL | The independence and reverse-anchor wording are fixed: the scores are "not statistically independent" and only internal-consistency checks (v2:75-83), and the reference is now described as less replication-dominated (v2:35-37). However, v2 adds a false per-firm ordering claim that all three scores make Firm C most hand-leaning (v2:93); Script 38's reverse-anchor mean instead ranks Firm D highest. |
+| M3. Classifier conflation; only the simplified binary rule was validated. | Major | PARTIAL | v2 now declares the inherited five-way box rule as primary (v2:123-143) and K=3 as descriptive (v2:143). It also correctly notes that the kappa comparison validates only the binary high-confidence rule, not the five-way moderate band (v2:103). The unresolved moderate-band validation is still open (v2:190-192), and v2:125 still uses binary-rule correlations to support the full five-way rule without recalibration. |
+| M4. Pixel-identity "FAR" naming and evidentiary force were wrong. | Major | CLOSED | v2 renames this to a positive-anchor miss rate, frames it as a one-sided replicated-positive check, and adds the tautology/conservative-subset caveat (v2:111-121). |
+| M5. Empirical/provenance claims needed correction or explicit unverified status. | Major | CLOSED | The 0.005 denominator is now a stability tolerance, not a bootstrap CI (v2:65, v2:107); all-non-Firm-A dip values are corrected (v2:21, v2:43); BD/McCrary is narrowed to Big-4 null with external dHash transitions disclosed (v2:47); Firm A byte-decomposition details are marked inherited/not regenerated (v2:31, v2:176); "tail distorts" is softened to a scope-dependent shift (v2:19). |
+| m1. Dip-test p-value precision needed bootstrap-resolution wording. | Minor | CLOSED | v2 states no bootstrap replicate exceeded the observed statistic and reports `p < 5 x 10^-4` for `n_boot = 2000` (v2:21, v2:43, v2:158-159). |
+| m2. Delta BIC sign convention was confusing. | Minor | CLOSED | v2 defines lower BIC as preferred and reports `BIC(K=3) - BIC(K=2) = -3.48`, plus "K=3 lower by 3.48" (v2:45, v2:63). |
+| m3. Per-signature convergence is only moderate for the box rule. | Minor | CLOSED | v2 includes the `SIG_CONVERGENCE_MODERATE` verdict and avoids calling the Paper A-vs-K=3 kappas strong (v2:95-103). |
+| m4. Per-CPA vs per-signature component centers drift more than v1 suggested. | Minor | CLOSED | v2 says the fits recover a "broadly similar three-component ordering" and reports the C1 cosine drift of 0.018 (v2:95). |
+| m5. Section III-L title was misleading. | Minor | CLOSED | The section is now titled "Signature- and Document-Level Classification" and separates per-signature categories from document aggregation (v2:123-143). |
+| m6. K=3 alternative lacked document aggregation. | Minor | CLOSED | v2 no longer offers K=3 as a signature/document classifier, so a K=3 document aggregation rule is no longer required (v2:143). |
+| m7. Firm anonymization was inconsistent. | Minor | CLOSED | v2 uses Firm A-D pseudonyms in the methodology text and no longer names the Big-4 firms directly in the prose (v2:17, v2:31, v2:194). |
+| e1. Replace "more-replicated-population baseline." | Editorial | CLOSED | v2 now calls non-Big-4 a less-replicated external/reverse-anchor reference (v2:35-37). |
+| e2. Replace "failure rate" for Lens 3. | Editorial | CLOSED | Lens 3 is now "Paper A box-rule hand-leaning rate" (v2:83). |
+| e3. "Strongest single methodology-validation signal" was too strong. | Editorial | CLOSED | v2 uses "strongest internal-consistency signal" and denies external validation (v2:77, v2:93). |
+| e4. "Boundary moves modestly" understated LOOO membership instability. | Editorial | CLOSED | v2 uses composition-sensitive wording and reports the 12.8 pp Firm C fold deviation (v2:65, v2:109). |
+| e5. "Calibration uncertainty band of +/- 5-13 pp" wording needed correction. | Editorial | CLOSED | v2 reports observed absolute differences of 1.8-12.8 pp and the 5 pp viability bar (v2:109). |
+| e6. "Operational threshold derivation" language was inaccurate. | Editorial | CLOSED | v2 consistently calls K=3 a mixture characterisation/descriptive model, not an operational threshold source (v2:49-73, v2:143). |
+| e7. Cross-reference index should be removed or made internal. | Editorial | PARTIAL | v2 labels the cross-reference index as an author checklist to remove before submission (v2:181), but it remains inside the methodology draft (v2:181-188). |
+
+## Newly introduced issues
+
+1. **New factual/provenance error: the three scores do not agree on the most hand-leaning firm.** v2 claims that "by all three scores, Firm A is the most replication-dominated and Firm C is the most hand-leaning" (v2:93). Script 38 confirms Firm A is most replication-dominated, but not the Firm C part for all scores: mean P_C1 and mean hand_frac rank Firm C highest, while mean reverse-anchor ranks Firm D highest (`-0.7125` vs Firm C `-0.7672`, with higher score meaning more hand-leaning). Revise to: "P_C1 and box-rule hand_frac rank Firm C highest; the reverse-anchor score ranks Firm D highest; all three agree Firm A is most replication-dominated and the non-A firms are more hand-leaning than Firm A."
+
+2. **Unsupported scope superlative: "any single firm" / "smallest scope" is not proven by the supplied reports.** v2 says no dip-test rejection holds "within any single firm pooled alone" and that Big-4 is the "smallest scope" supporting a finite-mixture model (v2:21; repeated more generally at v2:43). The supplied Script 32 report verifies Firm A alone, `big4_non_A`, and `all_non_A`; it does not report separate single-firm tests for Firms B, C, and D or all smaller combinations. Narrow this to "among the tested comparison scopes in Script 32" or add the missing single-firm tests.
+
+3. **K=3 hard labels are incorrectly described as used in the Spearman correlations.** v2:143 says the "K=3 hard label" is used for the internal-consistency Spearman correlations. Script 38's Spearman table uses the K=3 posterior score `P_C1`, not hard labels. Change v2:143 to "K=3 posterior score is used for the Spearman correlations; hard labels are used for the cluster cross-tabulation."
+
+4. **Provenance table over-cites Script 38 for the Big-4 signature count.** v2:17 and v2:152 attribute the 150,442 signature count partly/directly to Script 38. In the supplied markdown report, Script 39 directly reports the 150,442 signature-level cloud; Script 38's visible report does not directly state that count. Keep Script 39 as the direct source unless the JSON artifact is also cited.
+
+5. **"Max fold-to-fold deviation" wording is imprecise.** v2 reports a K=2 "max fold-to-fold deviation" of 0.028 (v2:65, v2:107). Script 36's 0.0278 is the max absolute deviation across folds as reported in the stability summary, not the pairwise fold range; the fold cut range is about 0.0376 (0.9756 - 0.9380). Use the report's exact wording or explicitly define the statistic.
+
+## Provenance re-verification
+
+| v2 numerical claim | v2 lines | Spike-report check | Status |
+|---|---:|---|---|
+| Big-4 has 437 CPAs split 171 / 112 / 102 / 52. | v2:17, v2:151 | Script 36 reports 437 CPAs; Script 34 reports the four firm counts. | CONFIRMED |
+| Big-4 signature-level cloud has 150,442 signatures. | v2:17, v2:95, v2:152 | Script 39 reports fitting on 150,442 signature-level points. | CONFIRMED, but source should be Script 39 rather than Script 38 in the provenance table. |
+| Big-4 K=2 crossings are cos 0.9755 and dHash 3.7549, with CIs [0.9742, 0.9772] and [3.4762, 3.9689]. | v2:45, v2:53, v2:154-156 | Script 36 and Script 34 report these point estimates and bootstrap CIs. | CONFIRMED |
+| K=3 components are C1 0.9457/9.1715/0.143, C2 0.9558/6.6603/0.536, C3 0.9826/2.4137/0.321. | v2:55-63, v2:163 | Scripts 35, 37, and 38 report the same centers and weights. | CONFIRMED |
+| K=3 LOOO membership deviations are 1.8-12.8 pp, with `P2_PARTIAL`. | v2:65, v2:109, v2:168 | Script 37 reports diffs 1.76, 4.68, 5.81, 12.77 pp and verdict `P2_PARTIAL`. | CONFIRMED |
+| Spearman correlations are 0.963, 0.889, and 0.879. | v2:85-91, v2:169 | Script 38 reports 0.9627, 0.8890, and 0.8794. | CONFIRMED |
+| All three scores rank Firm C as most hand-leaning. | v2:93 | Script 38 per-firm summary ranks Firm C highest on mean P_C1 and mean hand_frac, but Firm D highest on mean reverse-anchor. | FLAGGED |
+| Per-signature kappas are 0.662, 0.559, and 0.870; verdict moderate. | v2:95-103, v2:170 | Script 39 reports 0.6616, 0.5586, 0.8701 and `SIG_CONVERGENCE_MODERATE`. | CONFIRMED |
+| Pixel-identical subset is n=262 split 145 / 8 / 107 / 2, with 0% miss rate and Wilson upper 1.45%. | v2:111-119, v2:172-173 | Script 40 reports total 262, the per-firm split, and 262/262 correct for all three candidate classifiers with Wilson [0.00%, 1.45%]. | CONFIRMED |
+| Non-Firm-A dip values are 0.998/0.906 for `big4_non_A` and 0.998/0.907 for `all_non_A`. | v2:21, v2:43, v2:161-162 | Script 32 reports 0.9985/0.9055 and 0.9975/0.9065, matching v2 rounded values. | CONFIRMED |
+
+## Outstanding open questions
+
+1. **Five-way moderate-confidence validation still needs a decision.** v2 is honest that the v4 kappa evidence covers only the high-confidence binary rule (v2:103, v2:190-192). If the five-way classifier remains primary, the cleanest next step is a Big-4-specific capture/FAR/cross-tab analysis for the moderate band and the document-level worst-case aggregation. If not rerun, the manuscript should explicitly state that the moderate band remains inherited from v3.x and is not newly validated by Scripts 38-40.
+
+2. **Firm anonymisation policy still needs confirmation for §IV-V.** v2 itself is pseudonymous, but the open question at v2:194 remains real: once §IV-V discuss within-Big-4 contrasts, the manuscript should consistently use Firm A-D and keep any real-name mapping out of the paper body.
+
+3. **Section IV numbering can remain deferred.** v2:196 is procedural and does not block §III acceptance; resolve after the methodology claims and result-table sequence are frozen.
+
+## Recommended next-step actions
+
+1. Correct v2:93's per-firm ordering claim against Script 38.
+
+2. Decide whether to add a Big-4-specific validation for the five-way moderate band and document-level aggregation. If not, narrow v2:125 so binary-rule correlations do not appear to validate the full five-way classifier.
+
+3. Narrow the dip-test scope language at v2:21 and v2:43, or add missing individual-firm dip tests for Firms B-D.
+
+4. Fix v2:143 so Spearman correlations are tied to K=3 posterior scores, not K=3 hard labels.
+
+5. Correct the provenance table entry for the 150,442 signature count to cite Script 39 as the direct markdown-report source.
+
+6. Replace "max fold-to-fold deviation" with the exact Script 36 statistic or report the actual pairwise fold range.
+
+7. Remove the author checklist and open-question block from the manuscript version after these decisions are resolved.