Revise §III v4.0 draft per codex round-21 review (Major Revision -> v2)

Codex gpt-5.5 xhigh review of v1 draft returned Major Revision with 5 Major findings + 7 Minor + editorial nits. v2 addresses all of them. Key v2 changes: 1. Primary classifier declared: inherited v3.x five-way per-signature box rule. K=3 mixture is demoted to accountant-level descriptive characterisation (Script 35 / Script 38 footing), explicitly NOT used to assign signature- or document-level labels. 2. §III-J reframed as "Mixture Model and Accountant-Level Characterisation" (was "Mixture Model and Operational Threshold Derivation"). K=3 LOOO P2_PARTIAL verdict surfaced in prose including the "not predictively useful as an operational classifier" interpretation from the Script 37 verdict legend. 3. §III-K renamed "Convergent Internal-Consistency Checks" (was "Convergent Validation") with explicit caveat that the three scores share underlying features and are not statistically independent measurements. 4. §III-H reverse-anchor paragraph rewritten: the directional error in v1 (the non-Big-4 reference described as a "more- replicated-population baseline") is corrected -- the reference is in fact in the LESS-replicated regime relative to Big-4, and the score measures deviation in the hand-leaning direction. 5. Pixel-identity metric renamed from "FAR" to "positive-anchor miss rate" with explicit conservative-subset caveat ("near-tautological for the box rule because byte-identical => cosine ~1 / dHash ~0"). 6. §III-L title changed to "Signature- and Document-Level Classification" (was "Per-Document Classification") and reorganised so the per-signature five-way rule + document-level worst-case aggregation are both clearly under this section. 7. Empirical slips corrected: - K=2 LOOO comparison: now correctly says "5.6x the stability tolerance 0.005" rather than "5.6x the bootstrap CI half-width"; full-Big-4 bootstrap half-width 0.0015 cited separately. - all-non-Firm-A dip: now correctly (0.998, 0.907), not "p > 0.99". - BD/McCrary: now narrowed to Big-4 scope (Script 34 null), with Script 32 dHash transitions for non-Big-4 subsets noted but not used as operational thresholds. - Firm A byte-identical "50 partners of 180 registered, 35 cross-year" -- now explicitly inherited from v3.x §IV-F.1 / Script 28 / Appendix B; provenance row in the new table flags this as inherited, not v4-regenerated. - "mid/small-firm tail actively pulling" -> "the full-sample and Big-4-only calibrations differ" (causal language softened). - $\Delta\text{BIC}$ sign convention: explicit "lower BIC is preferred; BIC(K=3) - BIC(K=2) = -3.48". 8. Editorial nits applied: - "failure rate" -> "box-rule hand-leaning rate" - "boundary moves modestly" -> "membership remains composition-sensitive" - "calibration uncertainty band +/- 5-13 pp" -> "observed absolute differences of 1.8-12.8 pp, with Firm C exceeding the 5 pp viability bar" - "strongest single methodology-validation signal" -> "strongest internal-consistency signal" - "the same component structure recovers" -> "a broadly similar three-component ordering recovers" - Cross-reference index marked as author checklist (remove before submission). 9. New provenance table at end of §III mapping every numerical claim to (script, source, direct/derived/inherited). 10. Open questions reduced from 5 to 3 (codex resolved questions 2, 3, 4 with concrete answers); remaining 3 are forward-looking (5-way moderate band, pseudonym consistency, table numbering). Also commits: paper/codex_review_gpt55_v4_round1.md (codex review artifact, 143 lines). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:49:59 +08:00
parent d0bf2fe911
commit 62a22ceb83
2 changed files with 256 additions and 94 deletions
@@ -0,0 +1,143 @@
+# Paper A v4.0 Methodology Section III-G through III-L Peer Review
+
+Reviewer: gpt-5.5 xhigh  
+Date: 2026-05-12  
+Round number: 21 (v4 round 1)  
+Review target: `paper/v4/paper_a_methodology_v4_section_iii.md`
+
+Audit aliases used below:
+
+- V4: `paper/v4/paper_a_methodology_v4_section_iii.md`
+- V3: `paper/paper_a_methodology_v3.md`
+- Script36: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/calibration_and_loo_validation/calibration_loo_report.md`
+- Script37: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/k3_loo_check/k3_loo_report.md`
+- Script38: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/convergence_k3_reverse_anchor/convergence_report.md`
+- Script39: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/signature_level_convergence/sig_level_report.md`
+- Script40: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/pixel_identity_far/far_report.md`
+- Script34 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_only_pooled/big4_only_pooled_report.md`
+- Script35 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_k3_cluster_inspection/inspection_report.md`
+- Script32 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/non_firm_a_calibration/non_firm_a_calibration_report.md`
+
+## Verdict
+
+Major Revision.
+
+## Major Findings
+
+1. **K=3 is not yet justified as an operational classifier.**
+
+   V4 selects K=3 for the operational per-CPA classifier (V4:57, V4:67) and says the K=3/K=2 contrast justifies selecting K=3 (V4:107). The underlying Script37 verdict is weaker: `P2_PARTIAL`, with the explicit interpretation that the C1 cluster exists but "membership is not well-predicted by held-out fit" (Script37:92, Script37:94). The report's own legend says `P2_PARTIAL` means the cluster is "not predictively useful as an operational classifier" (Script37:97-99).
+
+   The numbers support this concern. K=3 C1 component shape is stable (max deviations 0.0047 cosine, 0.955 dHash, 0.023 weight; Script37:77-79), but held-out C1 membership differs from baseline by up to 12.77 percentage points (Script37:83-90). For PwC, baseline C1 is 23.5% but held-out prediction is 36.27% (Script37:47-51, Script37:87). That is not a small operational error if the label is used to classify CPAs.
+
+   The BIC evidence is also weak. K=3 is lower BIC than K=2 by only 3.48 points (Script36:9-10; Script34 local:40-41). This is acceptable as mild descriptive support, not as the load-bearing reason to replace a classifier. The draft should either (a) demote K=3 to a descriptive/convergent-validation model, or (b) make K=3 primary only with explicit LOOO membership uncertainty and soft-posterior reporting.
+
+2. **The "three independent lenses" framing overstates independence and validation strength.**
+
+   V4 describes the convergent validation as three "independent statistical lenses" (V4:73-89). They are not independent empirical measurements. All three are deterministic functions of the same per-CPA or per-signature `(cos, dHash)` features:
+
+   - Lens 1 is K=3 posterior from the same two descriptors (V4:77; Script38:6-12).
+   - Lens 2 is a monotone transform of the cosine marginal only (V4:78; Script38:16-18).
+   - Lens 3 is the fraction of signatures failing the same box rule `cos > 0.95 AND dh <= 5` (V4:79; Script38:20-22).
+
+   The high Spearman correlations are verified (0.9627, 0.8890, 0.8794; Script38:24-34), but they are partly mechanical agreement among feature-derived scores. They do not validate the classifier against an independent ground truth for hand-signed signatures.
+
+   There is also a conceptual reversal in the reverse-anchor prose. V4 says the non-Big-4 reference has lower cosine and higher dHash than the Big-4 C1 center (V4:37), which is verified (reference center 0.9349/9.7670 in Script38:16-18; C1 0.9457/9.1715 in Script38:8-12). But V4 then calls this a "more-replicated-population" baseline (V4:37). Lower cosine and higher dHash indicate less replication / more hand-leaning, not more replication. A reviewer will likely catch this immediately.
+
+3. **The draft conflates at least three classifiers and then validates only one simplified binary rule.**
+
+   V4 alternates among (i) K=3 per-CPA hard labels (V4:67), (ii) a binary Paper A box rule `cos > 0.95 AND dh <= 5` (V4:69), and (iii) the inherited five-way per-signature/document rule with `dh <= 5`, `5 < dh <= 15`, and `dh > 15` bands (V4:123-135). The Script38/39 convergence results validate only the simplified binary rule `non_hand iff cos > 0.95 AND dh <= 5` (Script38:20-22; Script39:8-12). They do not validate the full five-way classifier, especially the moderate non-hand-signed band `5 < dh <= 15`.
+
+   This matters because V3's inherited Section III-K explicitly treated `cos > 0.95 AND 5 < dh <= 15` as "Moderate-confidence non-hand-signed" (V3:278-287). V4 keeps that category (V4:127) but cites kappa/rho evidence from a binary high-confidence-only rule (V4:121). The current prose therefore overstates what the Script39 kappa values prove.
+
+   Recommended fix: choose a primary endpoint. If the five-way rule remains primary, validate that exact five-way rule or its declared binary collapse. If K=3 becomes primary, provide a document-level aggregation rule for K=3 and stop calling the inherited box rule the operational classifier.
+
+4. **The pixel-identity validation is useful, but "FAR" is the wrong metric name and the evidentiary force is overstated.**
+
+   Script40's ground truth is a positive class: pixel-identical signatures are treated as replicated (Script40:4-8). Misclassifying them as hand-leaning is a false negative / miss rate on an easy positive-anchor subset, not a false-alarm rate in the usual classifier sense. V4 defines FAR as "probability of labelling a pixel-identical signature as hand-leaning" (V4:109), which reverses standard terminology.
+
+   The 0/262 result is verified for all three classifiers (Script40:12-18), and the caveat that pixel-identity is necessary but not sufficient is appropriate (V4:117; Script40:29-31). But for the Paper A box rule this result is close to tautological: byte-identical nearest-neighbor signatures will have near-maximal cosine and minimal dHash. V3 was more careful, noting that FRR against byte-identical positives is trivially zero at thresholds below 1 and should be interpreted qualitatively (V3:266-268).
+
+   Rename this metric to "pixel-identity positive-anchor miss rate" or "false-hand rate on replicated positives." Do not present it as FAR unless a true hand-signed negative anchor is evaluated.
+
+5. **Several empirical/provenance claims need correction or explicit "unverified" status.**
+
+   - V4 says the K=2 LOOO max cosine deviation 0.028 is `5.6x` a "bootstrap CI half-width of 0.005" (V4:103). Script36 reports max deviation 0.0278 (Script36:43), but 0.005 is the stability tolerance in the verdict legend, not the bootstrap CI half-width (Script36:50-52). The full Big-4 bootstrap cosine CI half-width is 0.0015 (Script36:14-17). Correct the denominator and wording.
+
+   - V4 says all-non-Firm-A is dip-test unimodal at `p > 0.99` (V4:21). Script32 local reports all-non-Firm-A cosine p = 0.9975 but dHash p = 0.9065 (Script32 local:56-76). The later detailed sentence in V4 correctly gives 0.998/0.907 (V4:43). Fix the earlier overstatement.
+
+   - V4 says no BD/McCrary transition is identified on either axis and cites Script32/34 (V4:47). Script34 local supports no Big-4-only BD/McCrary threshold (Script34 local:28-31), but Script32 local reports dHash BD/McCrary thresholds for `big4_non_A` and `all_non_A` (Script32 local:36-44, Script32 local:68-76). Narrow the claim to the Big-4-only analysis or explain why Script32 subset transitions are not used.
+
+   - The Firm A byte-identical claim is partly verified. Script40 verifies 145 Firm A pixel-identical signatures inside the 262 Big-4 total (Script40:20-27). The added details "50 distinct Firm A partners," "of 180 registered," and "35 span different fiscal years" appear in V3 (V3:165) and V4 (V4:31), but I did not find them in the supplied Script36-40 reports. Treat those details as unverified unless the Appendix B/script artifact is cited directly.
+
+   - The "mid/small-firm tail actively pulling the v3.x crossing" statement (V4:19) is stronger than the local Script34 evidence. Script34 local verifies the Big-4-only crossing and CI (Script34 local:18-24), and it reports a large offset from the published baseline (Script34 local:51-58). It does not, by itself, prove the causal language "actively pulling" rather than "the full-sample and Big-4-only calibrations differ."
+
+## Minor Findings
+
+1. **Dip-test p-value precision needs a resolution check.** V4 says bootstrap p-value estimation uses `n_boot = 2000` and reports `p < 10^-4` (V4:43). With a finite bootstrap of 2000, the natural resolution is about 1/2000 unless the script uses a different asymptotic/calibrated p-value. Script36/34 display p = 0.0000 (Script36:6-8; Script34 local:28-31). State the reporting convention precisely, e.g., "no bootstrap replicate exceeded the observed statistic; reported as p < 0.001" if that is what happened.
+
+2. **The Delta BIC sign convention is confusing.** V4 reports "Delta BIC = -3.5" (V4:65). Since lower BIC is preferred, a reviewer may expect `BIC(K=2) - BIC(K=3) = 3.48` or "K=3 lower by 3.48." Use one convention and define it.
+
+3. **Per-signature convergence is real but only moderate for the box rule.** Script39 verifies kappas of 0.6616, 0.5586, and 0.8701 (Script39:22-30). The report verdict is `SIG_CONVERGENCE_MODERATE`, not strong (Script39:41-48). V4's statement that box-rule disagreement reflects "different decision geometries" rather than signal disagreement (V4:99) is plausible but interpretive. Add the moderate verdict and avoid making geometry the only explanation.
+
+4. **Per-CPA vs per-signature component centers drift more than the prose suggests.** Script39 shows per-CPA C1 at cosine 0.9457 and per-signature C1 at 0.9280 (Script39:16-20). Kappa is high for K=3 perCPA vs perSig labels (Script39:28), but "the same component structure recovers" (V4:99) should be softened to "a broadly similar three-component ordering recovers."
+
+5. **The Section III-L title is misleading.** The section is titled "Per-Document Classification" (V4:119) but most of it defines per-signature categories (V4:121-133). The document-level aggregation appears only in one paragraph (V4:135). Either rename to "Signature- and Document-Level Classification" or split the two parts.
+
+6. **K=3 alternative output lacks document aggregation.** V4 says the K=3 alternative assigns each signature to C1/C2/C3 (V4:137), but if Section III-L is per-document classification, the K=3 alternative also needs a document-level worst-case or posterior aggregation rule.
+
+7. **Firm anonymization is inconsistent.** V4 names the four firms in Chinese and then says they are pseudonymized as Firms A-D (V4:17). Later it uses PwC directly (V4:31). V3 says firm-level results are reported under pseudonyms (V3:315-316). Decide whether v4 abandons anonymization; otherwise keep the main text pseudonymous and put the mapping outside the manuscript, if at all.
+
+## Editorial / Prose Nits
+
+1. Replace "more-replicated-population baseline" (V4:37) with "less-replicated external reference" or "hand-leaning external reference."
+
+2. Replace "failure rate" for Lens 3 (V4:79, V4:89) with "box-rule hand-leaning rate" or "non-replicated rate." "Failure" sounds like classifier failure rather than a hand-leaning outcome.
+
+3. "Strongest single methodology-validation signal" (V4:89) is too strong because the lenses share features. Use "strongest internal consistency signal."
+
+4. "Boundary moves modestly" (V4:105) understates the PwC fold, where C1 membership rises from 23.5% to 36.3% (Script37:47-51). Use "membership remains composition-sensitive."
+
+5. "Calibration uncertainty band of +/- 5-13 percentage points" (V4:105) should be "observed absolute differences of 1.8-12.8 percentage points, with the largest fold exceeding the report's 5 pp viability bar" (Script37:83-90).
+
+6. "Operational threshold derivation" (V4:51) is not accurate if the operational per-signature classifier remains the inherited box rule. Use "mixture model and component assignment" unless K=3 is truly primary.
+
+7. The cross-reference index is useful, but it should be removed from the submitted manuscript or converted into an internal author checklist.
+
+## Responses to the Five Open Questions
+
+1. **Scope justification.**
+
+   The three-point argument is directionally good but not yet sufficient. Add a fourth point explicitly restricting generalizability: primary claims are for the Big-4 audit-report context, while the 249 non-Big-4 CPAs are used only as robustness/reverse-anchor context unless Section IV-K independently validates them. Also soften "tail distorts" to "tail changes the fitted crossing" unless you cite a direct diagnostic for distortion. The Big-4 counts and crossings are verified (Script34 local:4-24; Script36:6-17), but the causal language needs restraint.
+
+2. **Firm A phrasing.**
+
+   Use "templated-end case study" or "replication-heavy descriptive reference." Do not use "calibration reference, descriptively defined post-hoc" unless Firm A actually calibrates a threshold in v4. The draft correctly says Firm A is not the calibration anchor (V4:33). Calling it a calibration reference reintroduces the v3 vulnerability.
+
+3. **K=3 vs K=2 rationale.**
+
+   As written, no. Selecting K=3 as an operational classifier on LOOO stability is not acceptable because Script37 says K=3 is only `P2_PARTIAL` and "not predictively useful as an operational classifier" (Script37:92-99). Do not strengthen the BIC argument; Delta BIC about 3.5 is mild. The defensible claim is: K=2 is clearly unstable; K=3 gives a reproducible hand-leaning component shape; hard membership remains uncertain and should be reported as calibration uncertainty.
+
+4. **Hybrid box rule plus K=3 alternative.**
+
+   The hybrid can be acceptable only if roles are sharply separated: inherited five-way box rule is the primary signature/document classifier; K=3 is an accountant-level characterization and exploratory alternative. The current draft blurs this by calling K=3 "operational" (V4:67) while keeping the box rule in Section III-L (V4:121-137). Also, the validation scripts use the binary high-confidence rule `dh <= 5`, not the full five-way rule with `dh <= 15`. Fix this before deciding whether to keep the hybrid.
+
+5. **Section IV numbering.**
+
+   Do not freeze table numbers yet. First settle the Methodology labels and primary classifier. Results should mirror this order: sample/scope, K=2/K=3 calibration, convergence lenses, K=2 and K=3 LOOO, pixel-identity positive-anchor check, signature/document classification outputs, then full-dataset robustness. After that, assign table numbers and verify every Section III cross-reference to Section IV-D/F/G/K.
+
+## Recommended Next-Step Actions
+
+1. Rewrite Sections III-J and III-K so K=3 is either clearly primary with uncertainty, or clearly descriptive. If descriptive, remove "operational threshold" language from the K=3 discussion.
+
+2. Add the Script37 `P2_PARTIAL` result directly to the prose. Do not hide the "not predictively useful as an operational classifier" implication.
+
+3. Decide and declare the primary classifier: inherited five-way box rule, binary high-confidence box rule, or K=3 hard/posterior labels. Align all validation text to that exact classifier.
+
+4. If the five-way rule remains primary, rerun or report validation for the five-way categories and the document-level worst-case aggregation, not just `cos > 0.95 AND dh <= 5`.
+
+5. Rename the pixel-identity metric from FAR to positive-anchor miss rate / false-hand rate. Add a separate specificity/FAR result only if a true hand-signed or inter-CPA negative anchor is evaluated.
+
+6. Correct the empirical slips: K=2 "0.005 bootstrap half-width," all-non-Firm-A `p > 0.99`, Script32 BD/McCrary wording, reverse-anchor "more-replicated" phrase, and any unverified Firm A byte-decomposition details.
+
+7. Add a short provenance table for every numerical claim in Sections III-G through III-L, including exact report path, script number, and whether the number is directly reported or inferred by arithmetic.