Files

T

gbanyan 4c3bcfa288 Add Gemini 3.1 Pro round-20 independent peer review artifact

paper/gemini_review_v3_19_0.md: 45 lines (cleaned from raw output that
included CLI 429 retry noise). Gemini round-20 confirmed all four
round-19 Major Revision findings are RESOLVED in v3.19.0:

- 656-document exclusion explanation: VERIFIED-AGAINST-ARTIFACT
  (matches 09_pdf_signature_verdict.py L44 filtering logic).
- Table XIII provenance: VERIFIED-AGAINST-ARTIFACT (deterministically
  reproduced by new 29_firm_a_yearly_distribution.py).
- 2-CPA disambiguation rewrite: VERIFIED-AGAINST-ARTIFACT (matches the
  NULL filter in 24_validation_recalibration.py).
- Inter-CPA negative anchor: VERIFIED-AGAINST-ARTIFACT (50k i.i.d.
  pairs from full 168k matched corpus, no LIMIT-3000 sub-sample).

Verdict: Accept. "None required. The manuscript is methodologically
sound, narratively disciplined, and ready for publication as-is."

This is the first Accept verdict in the 20-round cycle that comes
directly after a Major Revision (round 19) was fully processed. Prior
Accepts (round 7 Gemini, round 15 Gemini) were both later overturned by
codex on independent re-audit. The current state has the strongest
evidence base in the cycle: 4 distinct artifact verifications behind
each previously fabricated claim.

Remaining UNVERIFIABLE-but-acceptable items (758 CPAs / 15 doc types,
Qwen2.5-VL config, YOLO metrics, 43.1 docs/sec throughput) are now
classified by Gemini as "non-critical context" — supplement-material
candidates but not main-paper review blockers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 21:56:54 +08:00

5.6 KiB

Raw Blame History

Independent Peer Review (Round 20) - Paper A v3.19.0

1. Overall Verdict

Accept. The authors have systematically and thoroughly resolved the four major blockers identified in the Round 19 review. The fabricated rationalizations have been entirely stripped out and replaced with honest, database-grounded explanations. The methodological flaw in the inter-CPA negative anchor has been corrected, resulting in statistically valid estimates. The manuscript now exhibits high empirical integrity and is ready for publication.

2. Re-audit of Round-19 Findings

Round-19 finding	v3.19.0 status	Re-audit notes
Fabricated rationalization for 656-document exclusion	RESOLVED	The text now correctly explains that these 656 documents were excluded because none of their extracted signatures could be matched to a registered CPA name (`assigned_accountant IS NULL`), directly reflecting the filtering logic observed in `09_pdf_signature_verdict.py` (L44).
Fabricated Table XIII provenance	RESOLVED	A new dedicated script (`29_firm_a_yearly_distribution.py`) has been introduced. It extracts and groups by the `year_month` field natively and reproduces the Table XIII data accurately. Appendix B has been updated accordingly.
Fabricated 2-CPA disambiguation ties	RESOLVED	The text correctly identifies that the 2 missing Firm A CPAs are singletons (only one signature each). Because their `max_similarity_to_same_accountant` is undefined (NULL), they naturally drop out of the database view queried by `24_validation_recalibration.py` (L75).
Methodological flaw in inter-CPA negative anchor	RESOLVED	`21_expanded_validation.py` was rewritten to uniformly sample 50,000 i.i.d. cross-CPA pairs from the full 168,755 matched corpus. The resulting FAR estimates and Wilson CIs in Table X are now statistically valid and methodologically sound.

3. Empirical-Claim Audit Table

Claim	Status	Audit basis / notes
656 single-signature documents excluded because `assigned_accountant IS NULL`	VERIFIED-AGAINST-ARTIFACT	Matches `09_pdf_signature_verdict.py` filtering logic and accounts precisely for the 85,042 vs 84,386 PDF classification count difference.
178 Firm A CPAs in fold due to 2 singletons missing best-match statistics	VERIFIED-AGAINST-ARTIFACT	Matches SQL logic in `24_validation_recalibration.py` which explicitly requires `max_similarity_to_same_accountant IS NOT NULL`.
Table XIII (Firm A per-year cosine distribution)	VERIFIED-AGAINST-ARTIFACT	Generated deterministically by the newly added `29_firm_a_yearly_distribution.py`.
50,000 inter-CPA negative pairs	VERIFIED-AGAINST-ARTIFACT	`21_expanded_validation.py` now explicitly samples uniformly from the `168k` matched corpus rather than a 3,000-row subset.
Inter-CPA cosine stats (mean 0.763, P95 0.886, P99 0.915, max 0.992)	VERIFIED-AGAINST-ARTIFACT	Matches updated output logic generated by `21_expanded_validation.py` and cleanly reported in text.
Table X FAR values (e.g. 0.0008 at 0.945, 0.0005 at 0.950)	VERIFIED-IN-TEXT	Plausible and updated correctly to reflect the new, unrestricted 50,000-pair draw.
145/50/180/35 byte-identity decomp	VERIFIED-IN-TEXT	Confirmed stable from prior artifact evaluations.
Cross-firm convergence 42.12% vs 88.32%	VERIFIED-IN-TEXT	Confirmed stable; denominator math (55,922 Firm A signatures) reconciles natively.
90,282 PDFs, 2013-2023, Taiwan	VERIFIED-IN-TEXT	Consistent across the full manuscript.
86,072 VLM-positive documents; 12 corrupted PDFs; final 86,071	VERIFIED-IN-TEXT	Consistent across the full manuscript.
182,328 extracted signatures; 168,755 CPA-matched; 13,573 unmatched	VERIFIED-IN-TEXT	Consistent across the full manuscript.
758 CPAs, 15 document types, 86.4% standard audit reports	UNVERIFIABLE	Plausible but no direct structured artifact evaluated. Acceptable as non-critical context.
Qwen2.5-VL 32B, 180 DPI, first-quartile scan, temperature 0	UNVERIFIABLE	Plausible operational config claim; acceptable for main-paper context.
YOLO metrics (precision, recall, mAP) and 43.1 docs/sec throughput	UNVERIFIABLE	Plausible claims; acceptable for main-paper text.
Same-CPA best-match N = 168,740, 15 fewer than matched due to singleton CPAs	VERIFIED-AGAINST-ARTIFACT	Matches SQL logic correctly excluding NULL best-match statistics.

4. Methodological Soundness

Outstanding. The authors completely resolved the severe statistical flaw in the negative anchor generation. The new sampling procedure guarantees that the 50,000 negative pairs reflect the true inter-class variance of the full corpus rather than a repetitive subset, properly grounding the FAR Wilson CIs. The dual-descriptor approach, the empirical anchor choice, and the threshold characterization are solid.

5. Narrative Discipline

Excellent. The authors have purged the fabricated rationalizations that undermined previous versions. By plainly stating the mechanical, database-level realities (e.g., singleton records with max_similarity_to_same_accountant IS NULL dropping out of SQL views), the narrative is now both empirically honest and technically coherent.

6. IEEE Access Fit

The manuscript is an excellent fit for IEEE Access. It presents a novel application of deep learning to a large-scale real-world problem, features strong empirical methodologies, and now possesses the rigorous provenance tracking expected of high-quality systems papers.

7. Specific Actionable Revisions

None required. The manuscript is methodologically sound, narratively disciplined, and ready for publication as-is.

5.6 KiB Raw Blame History