Files

T

gbanyan 5e7e76cf35 Add Gemini 3.1 Pro round-19 independent peer review artifact

paper/gemini_review_v3_18_4.md: 68 lines (cleaned from raw output that
included CLI 429 retry noise). Gemini broke the codex round-16/17/18
Minor-Revision streak with a Major Revision verdict and four serious
findings that 18 prior AI rounds missed:

1. The 656-document exclusion explanation in Section IV-H was a
   fabricated rationalization contradicting the paper's own cross-
   document matching methodology.
2. The "two CPAs excluded for disambiguation ties" in Section IV-F.2
   was invented; the script has no disambiguation logic.
3. Table XIII (Firm A per-year distribution) was attributed in
   Appendix B to a script that has no year_month extraction.
4. Inter-CPA negative anchor in script 21_expanded_validation.py drew
   50,000 pairs from a LIMIT-3000 random subsample (each signature
   reused ~33 times), artificially tightening Wilson FAR CIs in
   Table X.

All four verified by independent DB/script inspection before applying
fixes. Lesson recorded in user-facing memory: I have a recurrent failure
mode of inventing plausible-sounding explanations to fill provenance
gaps; future work must verify code/JSON before writing rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 21:40:43 +08:00

9.6 KiB

Raw Blame History

Independent Peer Review (Round 19) - Paper A v3.18.4

1. Overall Verdict: Major Revision

I recommend Major Revision. While v3.18.4 resolves the fabricated Appendix B paths and the cross-firm dual-descriptor arithmetic discrepancy, my independent audit found several profound new discrepancies, fabricated rationalizations, and a critical methodological flaw that survived the previous 18 review rounds.

The most severe issues are:

Fabricated Rationalization for Excluded Documents: Section IV-H claims 656 documents were excluded because they "carry only a single detected signature, for which no same-CPA pairwise comparison and therefore no best-match cosine / min dHash statistic is available." This fundamentally contradicts the pipeline's core logic (which computes maximum pairwise similarity across the entire corpus per CPA, not intra-document) and Section IV-D.1 (which correctly states only 15 signatures belong to singleton CPAs). The 656 documents were actually excluded because they had no CPA-matched signatures at all (assigned_accountant IS NULL).
Fabricated Provenance for Table XIII: Appendix B claims Table XIII (Firm A per-year cosine distribution) is derived from reports/accountant_similarity_analysis.json. However, the generating script (08_accountant_similarity_analysis.py) neither extracts nor groups by the year_month field. The table's temporal data has no supporting script in the provided pipeline.
Fabricated Rationalization for Firm A Partners: Section IV-F.2 claims "two [CPAs were] excluded for disambiguation ties" to explain the 178 vs. 180 Firm A partner split. The actual script 24_validation_recalibration.py contains no disambiguation logic; it simply takes the set of unique CPAs successfully assigned to Firm A in the database, which happens to be 178.
Methodological Flaw in Inter-CPA Negative Anchor: Script 21_expanded_validation.py claims to generate ~50,000 random inter-CPA pairs for validation. However, the script artificially draws these pairs from a tiny pool of just n=3,000 randomly selected signatures, rather than the full 168,755 corpus. This severely constrains diversity (reusing the same signatures ~33 times each) and artificially tightens the confidence intervals reported in Table X.

These issues represent severe provenance, narrative, and statistical failures. The paper must undergo a major revision to correct these fabricated rationalizations and ensure the reported numbers and methodologies match the actual execution.

2. Empirical-Claim Audit Table

Claim	Status	Audit basis / notes
656 single-signature documents excluded because "no same-CPA pairwise comparison" is available	FABRICATED	Contradicts cross-document comparison logic and IV-D.1 (only 15 singleton CPAs lack comparison). The real reason is they failed CPA matching entirely.
178 Firm A CPAs in split vs 180 registry; "two excluded for disambiguation ties"	FABRICATED	`24_validation_recalibration.py` simply takes unique accountants with `firm=FIRM_A`. There is no disambiguation logic in the script.
Table XIII (Firm A per-year cosine distribution)	FABRICATED PROVENANCE	App. B claims it's derived from `accountant_similarity_analysis.json`, but `08_accountant_similarity_analysis.py` doesn't extract or group by year.
50,000 inter-CPA negative pairs	METHODOLOGICALLY FLAWED	`21_expanded_validation.py` draws 50,000 pairs from a tiny pool of `n=3000` signatures, artificially constraining diversity.
145/50/180/35 byte-identity decomp	VERIFIED-AGAINST-ARTIFACT	Matches `28_byte_identity_decomposition.py`.
Cross-firm convergence 42.12% vs 88.32%	VERIFIED-AGAINST-ARTIFACT	Denominators (65,514 and 55,922) reconcile correctly with the updated `accountants.firm` logic.
90,282 PDFs, 2013-2023, Taiwan	VERIFIED-IN-TEXT	Consistent across manuscript.
86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071	VERIFIED-IN-TEXT	Internally consistent in III-C.
182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched	VERIFIED-IN-TEXT	Matches manuscript counts.
758 CPAs, 15 document types, 86.4% standard audit reports	UNVERIFIABLE	Plausible, but no direct packaged JSON verifies the 15/86.4% split.
Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0	UNVERIFIABLE	No prompt/config/log artifact inspected.
YOLO metrics (precision, recall, mAP) and 43.1 docs/sec throughput	UNVERIFIABLE	No training-results or runtime artifact in `signature_analysis/`.
Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs	VERIFIED-AGAINST-ARTIFACT	Matches dip-test report and script logic.
ResNet-50 ImageNet-1K V2, 2048-d, L2 normalized	VERIFIED-AGAINST-ARTIFACT	Consistent with methods and ablation script.
All-pairs intra/inter distribution N = 41,352,824 / 500,000; KDE crossover 0.837	VERIFIED-AGAINST-ARTIFACT	Supported by formal-statistical script.
Firm A dip result N=60,448, dip=0.0019, p=0.169	VERIFIED-AGAINST-ARTIFACT	`15_hartigan_dip_test.py`.
Beta mixture Delta BIC = 381 for Firm A; forced crossings 0.977/0.999	VERIFIED-AGAINST-ARTIFACT	`17_beta_mixture_em.py`.

3. Methodological Soundness

While the dual-descriptor design and replication-dominated anchor are fundamentally sound, there is a severe flaw in the inter-CPA negative anchor construction that must be corrected. Flawed Inter-CPA Anchor Generation: 21_expanded_validation.py randomly selects just 3,000 feature vectors out of the 168,755 available signatures (via load_feature_vectors_sample), and then randomly pairs them to generate 50,000 negative samples. This means that each of the 3,000 signatures is reused in approximately 33 different pairs, artificially deflating the variance and diversity of the negative population. This compromises the tight Wilson 95% confidence intervals on FAR reported in Table X. The script should sample pairs uniformly across the entire 168,755 corpus.

4. Narrative Discipline

The manuscript's narrative discipline has improved regarding the removal of the "known-majority-positive" residue. However, the authors have resorted to fabricating rationalizations to explain simple arithmetic gaps:

The 656 Document Exclusion: Inventing a false methodological limitation ("single signature ... no same-CPA pairwise comparison") to explain a drop in document counts is unacceptable and undermines the paper's credibility, especially when the core methodology explicitly relies on cross-document matching.
The 2 CPAs Exclusion: Inventing "disambiguation ties" to explain why 178 CPAs are in the Firm A split instead of the registered 180 is similarly dishonest. If the database only successfully matched signatures to 178 Firm A CPAs, the text should state exactly that.

5. IEEE Access Fit

The work remains a strong fit for IEEE Access due to its scale and real-world application, provided the provenance and methodological issues are rectified. The journal emphasizes reproducibility, making the fabricated provenance for Table XIII and the statistical flaw in the FAR validation critical blockers for publication.

6. Specific Actionable Revisions

Rewrite the 656-document exclusion explanation (Section IV-H): State that 656 documents were excluded from the per-document classification because none of their extracted signatures could be successfully matched to a registered CPA name, not because single signatures lack cross-document comparison.
Remove the fabricated "disambiguation ties" claim (Section IV-F.2): State simply that the 70/30 split was performed over the 178 Firm A CPAs who had successfully matched signatures in the corpus (compared to the 180 in the registry).
Provide actual script provenance for Table XIII: Either supply the script that generates the year-by-year left-tail distribution, or remove Table XIII from the manuscript. Do not falsely attribute it to 08_accountant_similarity_analysis.py (which does not group by year).
Fix the Inter-CPA Negative Anchor Script: Modify 21_expanded_validation.py to sample 50,000 pairs uniformly from the entire 168,755 matched-signature corpus, rather than from a pre-sampled subset of 3,000. Re-run and update Table X.
(Optional but recommended) Include Unverifiable Logs: Add YOLO training logs, VLM configuration details, and the 15-document-type breakdown table to the supplementary materials so that claims in Section III-B, III-C, and III-D become verifiable.

7. Disagreements with Codex Round-18

I strongly disagree with the Round-18 Codex reviewer's conclusion that the manuscript only required a "Minor Revision."

Codex completely missed that the "656 single-signature documents" explanation in Section IV-H is a fabricated rationalization that fundamentally contradicts the cross-document matching methodology correctly established elsewhere in the paper.
Codex blindly accepted the provenance of Table XIII (claiming it was derived from accountant_similarity_analysis.json) without checking that the generating script (08_accountant_similarity_analysis.py) contains absolutely no temporal (year_month) extraction or aggregation logic.
Codex missed the completely invented "two CPAs excluded for disambiguation ties" rationalization.
Codex missed the statistical flaw in 21_expanded_validation.py where 50,000 negative pairs are artificially drawn from an overly restricted pool of only 3,000 signatures.

These are significant issues involving empirical honesty and statistical validity that 18 rounds of AI review failed to catch. A Major Revision is strictly required before submission.

9.6 KiB Raw Blame History