- Script 13: Firm A normality/multimodality analysis (Shapiro-Wilk, Anderson-Darling, KDE, per-accountant ANOVA, Beta/Gamma fitting)
- Script 14: Independent min-dHash computation across all pairs per accountant (not just cosine-nearest pair)
- THRESHOLD_VALIDATION_OPTIONS: 2026-01 discussion doc on threshold validation approaches
- .gitignore: exclude model weights, node artifacts, and xlsx data
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paper draft includes all sections (Abstract through Conclusion), 36 references,
and supporting scripts. Key methodology: Cosine similarity + dHash dual-method
verification with thresholds calibrated against known-replication firm (Firm A).
Includes:
- 8 section markdown files (paper_a_*.md)
- Ablation study script (ResNet-50 vs VGG-16 vs EfficientNet-B0)
- Recalibrated classification script (84,386 PDFs, 5-tier system)
- Figure generation and Word export scripts
- Citation renumbering script ([1]-[36])
- Signature analysis pipeline (12 steps)
- YOLO extraction scripts
Three rounds of AI review completed (GPT-5.4, Claude Opus 4.6, Gemini 3 Pro).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Created PaddleOCR client for remote server communication
- Implemented text masking + region detection pipeline
- Test results: 100% recall on sample PDF (found both signatures)
- Identified issues: split regions, printed text not fully masked
- Documented 5 solution options in PADDLEOCR_STATUS.md
- Next: Implement region merging and two-stage cleaning