# Paper A v4.0 Methodology Section III-G through III-L Peer Review Reviewer: gpt-5.5 xhigh Date: 2026-05-12 Round number: 21 (v4 round 1) Review target: `paper/v4/paper_a_methodology_v4_section_iii.md` Audit aliases used below: - V4: `paper/v4/paper_a_methodology_v4_section_iii.md` - V3: `paper/paper_a_methodology_v3.md` - Script36: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/calibration_and_loo_validation/calibration_loo_report.md` - Script37: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/k3_loo_check/k3_loo_report.md` - Script38: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/convergence_k3_reverse_anchor/convergence_report.md` - Script39: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/signature_level_convergence/sig_level_report.md` - Script40: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/v4_big4/pixel_identity_far/far_report.md` - Script34 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_only_pooled/big4_only_pooled_report.md` - Script35 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/big4_k3_cluster_inspection/inspection_report.md` - Script32 local: `/Volumes/NV2/PDF-Processing/signature-analysis/reports/non_firm_a_calibration/non_firm_a_calibration_report.md` ## Verdict Major Revision. ## Major Findings 1. **K=3 is not yet justified as an operational classifier.** V4 selects K=3 for the operational per-CPA classifier (V4:57, V4:67) and says the K=3/K=2 contrast justifies selecting K=3 (V4:107). The underlying Script37 verdict is weaker: `P2_PARTIAL`, with the explicit interpretation that the C1 cluster exists but "membership is not well-predicted by held-out fit" (Script37:92, Script37:94). The report's own legend says `P2_PARTIAL` means the cluster is "not predictively useful as an operational classifier" (Script37:97-99). The numbers support this concern. K=3 C1 component shape is stable (max deviations 0.0047 cosine, 0.955 dHash, 0.023 weight; Script37:77-79), but held-out C1 membership differs from baseline by up to 12.77 percentage points (Script37:83-90). For PwC, baseline C1 is 23.5% but held-out prediction is 36.27% (Script37:47-51, Script37:87). That is not a small operational error if the label is used to classify CPAs. The BIC evidence is also weak. K=3 is lower BIC than K=2 by only 3.48 points (Script36:9-10; Script34 local:40-41). This is acceptable as mild descriptive support, not as the load-bearing reason to replace a classifier. The draft should either (a) demote K=3 to a descriptive/convergent-validation model, or (b) make K=3 primary only with explicit LOOO membership uncertainty and soft-posterior reporting. 2. **The "three independent lenses" framing overstates independence and validation strength.** V4 describes the convergent validation as three "independent statistical lenses" (V4:73-89). They are not independent empirical measurements. All three are deterministic functions of the same per-CPA or per-signature `(cos, dHash)` features: - Lens 1 is K=3 posterior from the same two descriptors (V4:77; Script38:6-12). - Lens 2 is a monotone transform of the cosine marginal only (V4:78; Script38:16-18). - Lens 3 is the fraction of signatures failing the same box rule `cos > 0.95 AND dh <= 5` (V4:79; Script38:20-22). The high Spearman correlations are verified (0.9627, 0.8890, 0.8794; Script38:24-34), but they are partly mechanical agreement among feature-derived scores. They do not validate the classifier against an independent ground truth for hand-signed signatures. There is also a conceptual reversal in the reverse-anchor prose. V4 says the non-Big-4 reference has lower cosine and higher dHash than the Big-4 C1 center (V4:37), which is verified (reference center 0.9349/9.7670 in Script38:16-18; C1 0.9457/9.1715 in Script38:8-12). But V4 then calls this a "more-replicated-population" baseline (V4:37). Lower cosine and higher dHash indicate less replication / more hand-leaning, not more replication. A reviewer will likely catch this immediately. 3. **The draft conflates at least three classifiers and then validates only one simplified binary rule.** V4 alternates among (i) K=3 per-CPA hard labels (V4:67), (ii) a binary Paper A box rule `cos > 0.95 AND dh <= 5` (V4:69), and (iii) the inherited five-way per-signature/document rule with `dh <= 5`, `5 < dh <= 15`, and `dh > 15` bands (V4:123-135). The Script38/39 convergence results validate only the simplified binary rule `non_hand iff cos > 0.95 AND dh <= 5` (Script38:20-22; Script39:8-12). They do not validate the full five-way classifier, especially the moderate non-hand-signed band `5 < dh <= 15`. This matters because V3's inherited Section III-K explicitly treated `cos > 0.95 AND 5 < dh <= 15` as "Moderate-confidence non-hand-signed" (V3:278-287). V4 keeps that category (V4:127) but cites kappa/rho evidence from a binary high-confidence-only rule (V4:121). The current prose therefore overstates what the Script39 kappa values prove. Recommended fix: choose a primary endpoint. If the five-way rule remains primary, validate that exact five-way rule or its declared binary collapse. If K=3 becomes primary, provide a document-level aggregation rule for K=3 and stop calling the inherited box rule the operational classifier. 4. **The pixel-identity validation is useful, but "FAR" is the wrong metric name and the evidentiary force is overstated.** Script40's ground truth is a positive class: pixel-identical signatures are treated as replicated (Script40:4-8). Misclassifying them as hand-leaning is a false negative / miss rate on an easy positive-anchor subset, not a false-alarm rate in the usual classifier sense. V4 defines FAR as "probability of labelling a pixel-identical signature as hand-leaning" (V4:109), which reverses standard terminology. The 0/262 result is verified for all three classifiers (Script40:12-18), and the caveat that pixel-identity is necessary but not sufficient is appropriate (V4:117; Script40:29-31). But for the Paper A box rule this result is close to tautological: byte-identical nearest-neighbor signatures will have near-maximal cosine and minimal dHash. V3 was more careful, noting that FRR against byte-identical positives is trivially zero at thresholds below 1 and should be interpreted qualitatively (V3:266-268). Rename this metric to "pixel-identity positive-anchor miss rate" or "false-hand rate on replicated positives." Do not present it as FAR unless a true hand-signed negative anchor is evaluated. 5. **Several empirical/provenance claims need correction or explicit "unverified" status.** - V4 says the K=2 LOOO max cosine deviation 0.028 is `5.6x` a "bootstrap CI half-width of 0.005" (V4:103). Script36 reports max deviation 0.0278 (Script36:43), but 0.005 is the stability tolerance in the verdict legend, not the bootstrap CI half-width (Script36:50-52). The full Big-4 bootstrap cosine CI half-width is 0.0015 (Script36:14-17). Correct the denominator and wording. - V4 says all-non-Firm-A is dip-test unimodal at `p > 0.99` (V4:21). Script32 local reports all-non-Firm-A cosine p = 0.9975 but dHash p = 0.9065 (Script32 local:56-76). The later detailed sentence in V4 correctly gives 0.998/0.907 (V4:43). Fix the earlier overstatement. - V4 says no BD/McCrary transition is identified on either axis and cites Script32/34 (V4:47). Script34 local supports no Big-4-only BD/McCrary threshold (Script34 local:28-31), but Script32 local reports dHash BD/McCrary thresholds for `big4_non_A` and `all_non_A` (Script32 local:36-44, Script32 local:68-76). Narrow the claim to the Big-4-only analysis or explain why Script32 subset transitions are not used. - The Firm A byte-identical claim is partly verified. Script40 verifies 145 Firm A pixel-identical signatures inside the 262 Big-4 total (Script40:20-27). The added details "50 distinct Firm A partners," "of 180 registered," and "35 span different fiscal years" appear in V3 (V3:165) and V4 (V4:31), but I did not find them in the supplied Script36-40 reports. Treat those details as unverified unless the Appendix B/script artifact is cited directly. - The "mid/small-firm tail actively pulling the v3.x crossing" statement (V4:19) is stronger than the local Script34 evidence. Script34 local verifies the Big-4-only crossing and CI (Script34 local:18-24), and it reports a large offset from the published baseline (Script34 local:51-58). It does not, by itself, prove the causal language "actively pulling" rather than "the full-sample and Big-4-only calibrations differ." ## Minor Findings 1. **Dip-test p-value precision needs a resolution check.** V4 says bootstrap p-value estimation uses `n_boot = 2000` and reports `p < 10^-4` (V4:43). With a finite bootstrap of 2000, the natural resolution is about 1/2000 unless the script uses a different asymptotic/calibrated p-value. Script36/34 display p = 0.0000 (Script36:6-8; Script34 local:28-31). State the reporting convention precisely, e.g., "no bootstrap replicate exceeded the observed statistic; reported as p < 0.001" if that is what happened. 2. **The Delta BIC sign convention is confusing.** V4 reports "Delta BIC = -3.5" (V4:65). Since lower BIC is preferred, a reviewer may expect `BIC(K=2) - BIC(K=3) = 3.48` or "K=3 lower by 3.48." Use one convention and define it. 3. **Per-signature convergence is real but only moderate for the box rule.** Script39 verifies kappas of 0.6616, 0.5586, and 0.8701 (Script39:22-30). The report verdict is `SIG_CONVERGENCE_MODERATE`, not strong (Script39:41-48). V4's statement that box-rule disagreement reflects "different decision geometries" rather than signal disagreement (V4:99) is plausible but interpretive. Add the moderate verdict and avoid making geometry the only explanation. 4. **Per-CPA vs per-signature component centers drift more than the prose suggests.** Script39 shows per-CPA C1 at cosine 0.9457 and per-signature C1 at 0.9280 (Script39:16-20). Kappa is high for K=3 perCPA vs perSig labels (Script39:28), but "the same component structure recovers" (V4:99) should be softened to "a broadly similar three-component ordering recovers." 5. **The Section III-L title is misleading.** The section is titled "Per-Document Classification" (V4:119) but most of it defines per-signature categories (V4:121-133). The document-level aggregation appears only in one paragraph (V4:135). Either rename to "Signature- and Document-Level Classification" or split the two parts. 6. **K=3 alternative output lacks document aggregation.** V4 says the K=3 alternative assigns each signature to C1/C2/C3 (V4:137), but if Section III-L is per-document classification, the K=3 alternative also needs a document-level worst-case or posterior aggregation rule. 7. **Firm anonymization is inconsistent.** V4 names the four firms in Chinese and then says they are pseudonymized as Firms A-D (V4:17). Later it uses PwC directly (V4:31). V3 says firm-level results are reported under pseudonyms (V3:315-316). Decide whether v4 abandons anonymization; otherwise keep the main text pseudonymous and put the mapping outside the manuscript, if at all. ## Editorial / Prose Nits 1. Replace "more-replicated-population baseline" (V4:37) with "less-replicated external reference" or "hand-leaning external reference." 2. Replace "failure rate" for Lens 3 (V4:79, V4:89) with "box-rule hand-leaning rate" or "non-replicated rate." "Failure" sounds like classifier failure rather than a hand-leaning outcome. 3. "Strongest single methodology-validation signal" (V4:89) is too strong because the lenses share features. Use "strongest internal consistency signal." 4. "Boundary moves modestly" (V4:105) understates the PwC fold, where C1 membership rises from 23.5% to 36.3% (Script37:47-51). Use "membership remains composition-sensitive." 5. "Calibration uncertainty band of +/- 5-13 percentage points" (V4:105) should be "observed absolute differences of 1.8-12.8 percentage points, with the largest fold exceeding the report's 5 pp viability bar" (Script37:83-90). 6. "Operational threshold derivation" (V4:51) is not accurate if the operational per-signature classifier remains the inherited box rule. Use "mixture model and component assignment" unless K=3 is truly primary. 7. The cross-reference index is useful, but it should be removed from the submitted manuscript or converted into an internal author checklist. ## Responses to the Five Open Questions 1. **Scope justification.** The three-point argument is directionally good but not yet sufficient. Add a fourth point explicitly restricting generalizability: primary claims are for the Big-4 audit-report context, while the 249 non-Big-4 CPAs are used only as robustness/reverse-anchor context unless Section IV-K independently validates them. Also soften "tail distorts" to "tail changes the fitted crossing" unless you cite a direct diagnostic for distortion. The Big-4 counts and crossings are verified (Script34 local:4-24; Script36:6-17), but the causal language needs restraint. 2. **Firm A phrasing.** Use "templated-end case study" or "replication-heavy descriptive reference." Do not use "calibration reference, descriptively defined post-hoc" unless Firm A actually calibrates a threshold in v4. The draft correctly says Firm A is not the calibration anchor (V4:33). Calling it a calibration reference reintroduces the v3 vulnerability. 3. **K=3 vs K=2 rationale.** As written, no. Selecting K=3 as an operational classifier on LOOO stability is not acceptable because Script37 says K=3 is only `P2_PARTIAL` and "not predictively useful as an operational classifier" (Script37:92-99). Do not strengthen the BIC argument; Delta BIC about 3.5 is mild. The defensible claim is: K=2 is clearly unstable; K=3 gives a reproducible hand-leaning component shape; hard membership remains uncertain and should be reported as calibration uncertainty. 4. **Hybrid box rule plus K=3 alternative.** The hybrid can be acceptable only if roles are sharply separated: inherited five-way box rule is the primary signature/document classifier; K=3 is an accountant-level characterization and exploratory alternative. The current draft blurs this by calling K=3 "operational" (V4:67) while keeping the box rule in Section III-L (V4:121-137). Also, the validation scripts use the binary high-confidence rule `dh <= 5`, not the full five-way rule with `dh <= 15`. Fix this before deciding whether to keep the hybrid. 5. **Section IV numbering.** Do not freeze table numbers yet. First settle the Methodology labels and primary classifier. Results should mirror this order: sample/scope, K=2/K=3 calibration, convergence lenses, K=2 and K=3 LOOO, pixel-identity positive-anchor check, signature/document classification outputs, then full-dataset robustness. After that, assign table numbers and verify every Section III cross-reference to Section IV-D/F/G/K. ## Recommended Next-Step Actions 1. Rewrite Sections III-J and III-K so K=3 is either clearly primary with uncertainty, or clearly descriptive. If descriptive, remove "operational threshold" language from the K=3 discussion. 2. Add the Script37 `P2_PARTIAL` result directly to the prose. Do not hide the "not predictively useful as an operational classifier" implication. 3. Decide and declare the primary classifier: inherited five-way box rule, binary high-confidence box rule, or K=3 hard/posterior labels. Align all validation text to that exact classifier. 4. If the five-way rule remains primary, rerun or report validation for the five-way categories and the document-level worst-case aggregation, not just `cos > 0.95 AND dh <= 5`. 5. Rename the pixel-identity metric from FAR to positive-anchor miss rate / false-hand rate. Add a separate specificity/FAR result only if a true hand-signed or inter-CPA negative anchor is evaluated. 6. Correct the empirical slips: K=2 "0.005 bootstrap half-width," all-non-Firm-A `p > 0.99`, Script32 BD/McCrary wording, reverse-anchor "more-replicated" phrase, and any unverified Firm A byte-decomposition details. 7. Add a short provenance table for every numerical claim in Sections III-G through III-L, including exact report path, script number, and whether the number is directly reported or inferred by arithmetic.