Compare commits
10 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 980295d5bd | |||
| b33e20d479 | |||
| 723a3f6eaf | |||
| 2f05d6f0c9 | |||
| 4cf21a64b2 | |||
| d4f370bd5e | |||
| 6db5d635f5 | |||
| 918d55154a | |||
| 10c82fd446 | |||
| e36c49d2d8 |
@@ -0,0 +1,157 @@
|
||||
# Paper A Round 26 Review - v4 round 6
|
||||
|
||||
Reviewer: gpt-5.5 xhigh
|
||||
Date: 2026-05-12
|
||||
Target: `paper/v4/paper_a_prose_v4_phase4.md` (Phase 4 prose draft v1)
|
||||
Foundation checked: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v6) and `paper/v4/paper_a_results_v4_section_iv.md` (§IV v3.2)
|
||||
Trajectory checked: rounds 21-25 plus v3.20.0 Abstract / §I / §II / §V / §VI baselines
|
||||
|
||||
## Verdict
|
||||
|
||||
Major Revision.
|
||||
|
||||
The technical core in §III v6 and §IV v3.2 is stable, but the new Phase 4 prose introduces several reviewer-visible regressions. The most important are: (i) the Abstract and Introduction revive the "independent scores" overclaim even though §III/§IV repeatedly say the three scores are not statistically independent; (ii) §I and §V overstate the Big-4 scope evidence by claiming unsupported single-firm and full-dataset dip-test non-rejections; (iii) §II is still a placeholder with `[add citation]`, not a submission-ready related-work section; and (iv) §V-G drops several inherited limitations from v3.20.0.
|
||||
|
||||
## Section-By-Section Findings
|
||||
|
||||
### Abstract
|
||||
|
||||
1. **Major - line 11: "Three independent feature-derived scores" contradicts the converged methodology.** §III-K states that the three scores are "not statistically independent measurements" because all are deterministic functions of the same descriptor means (§III:90), and §IV-F repeats the caveat (§IV:79). The Abstract should say "three feature-derived scores" or "three non-identical feature-derived summaries" and, if space allows, add the shared-feature caveat.
|
||||
|
||||
2. **Minor - line 11: "candidate classifiers" can be read as operational-classifier language.** One of the three "candidate classifiers" is the K=3 per-CPA hard label, which §III-J/§III-L explicitly demotes to descriptive characterisation, not operational signature/document classification (§III:64, §III:156). Use "candidate rules/scores" or explicitly reserve "operational classifier" for the inherited five-way box rule.
|
||||
|
||||
3. **Minor - line 11: the Abstract passes IEEE Access form but has no margin.** It is one paragraph and `wc -w` counts 247 words, so it satisfies the <=250-word target. Any added caveat will require trimming elsewhere.
|
||||
|
||||
4. **Minor - line 11: the Abstract does not name the primary operational output.** The abstract describes the pipeline and the K=3 / convergence / anchor checks, but it does not state that the primary operational output remains the inherited five-way per-signature classifier with worst-case document aggregation (§III-L; §IV-J). This omission makes the K=3 and reverse-anchor checks look more central operationally than §III/§IV allow.
|
||||
|
||||
### §I Introduction
|
||||
|
||||
1. **Major - line 31: the Big-4 scope claim is overbroad and partly unsupported.** The sentence says "neither any single firm pooled alone nor the broader full-dataset variant rejects unimodality." §III and §IV only report comparison dip tests for Firm A alone, Firms B+C+D pooled, and all non-Firm-A pooled (§III:34, §III:56; §IV:27-34). They explicitly state that single-firm dip tests for Firms B, C, and D were not separately computed (§III:34, §III:56; §IV:34). §IV-K is a light full-dataset K=3 + Spearman robustness check and does not report a full-dataset dip test (§IV:230-252). Rewrite this as "no narrower comparison scope tested in Script 32..." and remove the full-dataset dip-test claim unless a spike report is added.
|
||||
|
||||
2. **Major - line 29: the section cross-reference for accountant-level distributional characterisation is wrong.** The prose points to "§III-D" for the Big-4 accountant-level distributional characterisation. In the converged methodology, this material is §III-G through §III-J, especially §III-I and §III-J (§III:18-86). §IV-D/§IV-E are correct.
|
||||
|
||||
3. **Major - line 35: the Introduction repeats the "independent feature-derived scores" error.** The next sentence correctly says the scores are not statistically independent, but the opening clause still hands reviewers an avoidable contradiction. This was a central round-21/22 issue and should not reappear in the front matter.
|
||||
|
||||
4. **Minor - line 47: contribution 4 again overstates "not at narrower scopes."** The defensible phrase is "not in the narrower comparison scopes tested" because B/C/D single-firm dip tests were not computed.
|
||||
|
||||
5. **Minor - line 55: contribution 8 overclaims the full-dataset check.** §IV-K deliberately re-runs only K=3 + Paper A box-rule Spearman convergence at full `n = 686`; it does not re-run LOOO, five-way moderate-band validation, or operational threshold calibration (§IV:230). "Pipeline reproducibility at multiple scopes" should be narrowed to "the K=3 + box-rule rank-convergence check reproduces at the full-CPA scope."
|
||||
|
||||
6. **Minor - line 25: the methodological safeguards paragraph uses "external validation" too broadly.** The pixel-identity anchor is a conservative positive-subset check, the inter-CPA FAR is inherited corpus-wide, and LOOO is descriptive composition-sensitivity evidence. The paragraph should avoid implying full external validation of the operational classifier.
|
||||
|
||||
### §II Related Work
|
||||
|
||||
1. **Major - lines 63-65: §II is not submission-ready prose if inserted as written.** The section says v3.20.0 §II is retained "without substantive change," but the target Phase 4 file is supposed to replace the §II block. As written, it is a meta-summary rather than an actual Related Work section. Either the master manuscript must keep the full v3.20.0 §II text and splice in the LOOO paragraph, or this file must contain the full revised §II.
|
||||
|
||||
2. **Major - line 67: unresolved citation placeholder.** "`[add citation]`" is still present. This must be replaced before Phase 5; otherwise a reviewer can attack the only new Related Work content as uncited.
|
||||
|
||||
3. **Minor - line 67: "calibration uncertainty band on the operational rule" conflicts with the converged classifier framing.** §III-J says neither K=2 nor K=3 is used as an operational classifier (§III:64), and §III-L reserves operational classification for the inherited five-way box rule (§III:138-156). If the LOOO paragraph is about K=2/K=3 mixture fits, call it a composition-sensitivity or calibration-uncertainty check on the candidate mixture boundary/characterisation, not on "the operational rule."
|
||||
|
||||
### §V Discussion
|
||||
|
||||
1. **Major - line 81: the prose reifies mechanism labels at the CPA level.** "Some CPAs are templated, some are hand-leaning, some are mixed" is stronger than §III allows. §III-G says a per-CPA mean is a summary statistic, not a claim that all signatures for that CPA share a mechanism (§III:22). Use component-membership wording: "some CPAs' observed signatures place their per-CPA means in the templated, mixed, or hand-leaning regions."
|
||||
|
||||
2. **Major - line 81: the within-CPA unimodality explanation is speculative.** The claim that occasional template reuse "produces a unimodal per-signature distribution within the CPA but a multimodal per-CPA distribution across CPAs" is not directly tested in §III/§IV. v3.x tested Firm A and all-CPA signature-level distributions, and v4.0 adds per-signature K=3 consistency (§IV-F), but there is no per-CPA distributional test for individual CPAs.
|
||||
|
||||
3. **Major - lines 103-119: limitations are incomplete relative to v3.20.0 and the inherited pipeline.** The v4 limitations keep the Big-4 scope, missing hand-signed ground truth, pixel-identity subset, inherited-rule, A1, K=3 composition, and no-intent caveats. They drop v3 limitations that still apply: ImageNet-pretrained ResNet-50 without signature-domain fine-tuning (v3 §V:90-92), HSV red-stamp removal artifacts (v3 §V:93-95), longitudinal scanning/PDF/compression confounds (v3 §V:97-99), source-exemplar misattribution in max/min pair logic (v3 §V:100-102), and legal/regulatory interpretation limits (v3 §V:108-109). If these are intentionally retired, the draft needs a reason; otherwise they should be restored.
|
||||
|
||||
4. **Major - line 107: the scope limitation repeats the unsupported full-dataset dip-test implication.** The sentence says dip-test multimodality is "not available at narrower or broader scopes." §III/§IV do not report full-dataset dip-test results; §IV-K is explicitly a light Spearman robustness check (§IV:230-252). Keep the LOOO broader-scope caveat, but do not claim full-dataset dip-test non-availability without evidence.
|
||||
|
||||
5. **Minor - line 79: "v4.0 inherits and confirms" is too strong for the per-signature continuous-spectrum reading.** The exact v3 per-signature diagnostic package is inherited; v4.0's new per-signature evidence is mostly the K=3 consistency check (§IV-F) and five-way output (§IV-J). Safer: "v4.0 inherits this signature-level reading and remains consistent with it."
|
||||
|
||||
6. **Minor - line 85: inherited Firm A byte-level details need provenance language.** The 145 Firm A pixel-identical signatures are verified in Script 40, but the "50 distinct partners" and "35 cross-year" details are explicitly inherited from v3 / Script 28 and not regenerated in v4.0 (§III:44, §III:190). The discussion should mark that provenance, especially because the spike reports provided for v4 only verify the 145 count.
|
||||
|
||||
7. **Minor - line 87: Firm A does not alone anchor §IV-H.** §IV-H's positive-anchor subset is all Big-4 byte-identical signatures, `n = 262`, split 145 / 8 / 107 / 2 across Firms A-D (§IV:145-153). Firm A is the largest subset and the case-study evidence, but not the whole anchor.
|
||||
|
||||
8. **Minor - line 97: "published box rule" is not traceable.** §III/§IV call this the inherited Paper A / v3.x box rule, not a published external rule (§III:96, §III:138; §IV:85-87). Use "inherited box rule" unless there is a publication citation.
|
||||
|
||||
9. **Minor - line 97: "produce the same per-CPA ranking" is stronger than the evidence.** The scores are highly correlated, but §III/§IV note a residual non-Firm-A disagreement: reverse-anchor ranks Firm D fractionally above Firm C while P(C1) and box-rule hand-leaning rate rank Firm C highest (§III:106; §IV:102). Say "broadly concordant ranking."
|
||||
|
||||
10. **Minor - line 101: "candidate classifiers" again blurs operational status.** K=3 hard labels remain descriptive. This can be fixed together with the Abstract wording.
|
||||
|
||||
### §VI Conclusion And Future Work
|
||||
|
||||
1. **Major - line 127: "cross-scope pipeline reproducibility" overstates §IV-K.** The full-dataset result verifies only that K=3 P(C1) and Paper A hand-leaning-rate Spearman convergence remains high at `n = 686` with drift `0.0069` (§IV:242-250; full-dataset report:25-31). It does not reproduce the pipeline, the five-way classifier, the moderate-confidence band, LOOO, or operational thresholds at full scope.
|
||||
|
||||
2. **Minor - line 129: the future-work audit-quality contrast must stay explicitly descriptive.** "Firm A's 82% templated concentration vs Firm C's 23.5% hand-leaning concentration" comes from K=3 hard-posterior accountant-level assignment (§IV:215-224), whose membership is composition-sensitive (§IV:129-139). The future-work sentence is acceptable if it says these are descriptive component concentrations and that current Paper A provides no audit-quality correlation evidence.
|
||||
|
||||
3. **Minor - lines 125-127: the conclusion underplays the actual operational output.** It names the pipeline and methodological checks, but it does not mention the inherited five-way per-signature/document-level classifier that §III-L and §IV-J define as the operational output. This is not a numerical error, but it leaves the operational-vs-descriptive distinction less clear at closure.
|
||||
|
||||
## Reviewer-Attack Vulnerabilities Specific To The Prose
|
||||
|
||||
1. A reviewer can quote line 11 or line 35 ("independent feature-derived scores") against §III-K/§IV-F's non-independence caveat and argue that the paper exaggerates validation strength.
|
||||
|
||||
2. A reviewer can attack the Big-4 scope claim because the prose says "any single firm" and "full-dataset variant" even though B/C/D single-firm dip tests and full-dataset dip tests are not reported.
|
||||
|
||||
3. The current §II can be rejected as incomplete because it is a placeholder, not a related-work section, and includes `[add citation]`.
|
||||
|
||||
4. "Published box rule" invites a citation challenge. The body only supports "inherited Paper A / v3.x box rule."
|
||||
|
||||
5. The discussion sometimes turns descriptive component labels into apparent mechanism claims about CPAs. This conflicts with the §III-G rule that per-CPA means are summaries, not partner-level mechanism assignments.
|
||||
|
||||
6. The phrase "candidate classifiers" for K=3 and reverse-anchor checks can be read as walking back the round-21 convergence that K=3 is descriptive and the five-way box rule is operational.
|
||||
|
||||
7. The limitations section is vulnerable because it drops inherited limitations that still apply to the pipeline: feature backbone transfer, red-stamp preprocessing, longitudinal document-generation shifts, source-exemplar misattribution, and legal interpretation limits.
|
||||
|
||||
8. The full-dataset robustness claim is easy to overread. §IV-K is intentionally "light scope"; calling it pipeline reproducibility or cross-scope operational reproducibility exceeds the evidence.
|
||||
|
||||
## Provenance Verification Table
|
||||
|
||||
| # | Phase 4 numerical claim | Phase line(s) | Provenance checked | Status |
|
||||
|---:|---|---:|---|---|
|
||||
| 1 | Abstract is <=250 words | 11 | `sed -n '11p' ... \| wc -w` returned 247 | Confirmed, but close to limit |
|
||||
| 2 | 90,282 reports, 182,328 signatures, 758 CPAs | 11, 37, 125 | §IV:7 gives 90,282 PDFs; §IV:13 gives 182,328 extracted signatures; v3 §I:62 gives 758 CPAs | Confirmed with inherited full-corpus CPA source |
|
||||
| 3 | Big-4 sub-corpus: 437 CPAs, 150,442 signatures | 11, 37, 125 | §III:30; §IV:9, §IV:15; five-way report:14-15 | Confirmed |
|
||||
| 4 | Big-4 dip-test multimodality, `p < 5 x 10^-4` on both axes | 11, 31, 81, 127 | §III:34, §III:56, §III:171-172; §IV:27-34 | Confirmed for Big-4 |
|
||||
| 5 | "Neither any single firm pooled alone nor broader full-dataset variant rejects" | 31 | §III:34/56 and §IV:34 say only Firm A alone was tested among single firms; §IV-K has no full-dataset dip test | Not verified / overclaimed |
|
||||
| 6 | K=2 crossings `cos*=0.9755`, `dHash*=3.755`, cosine CI half-width 0.0015 | 31 | calibration report:16-17; §III:58, §III:166-170; §IV:60-63 | Confirmed |
|
||||
| 7 | K=2 LOOO max cosine-crossing deviation `0.028`, `5.6x` tolerance, Firm A held-out 100% vs non-A 0% | 31, 91 | calibration report:34-44; §III:78, §III:120; §IV:122-127 | Confirmed, with 0.0278 rounded to 0.028 |
|
||||
| 8 | K=3 components: C3 `0.983/2.41/0.321`, C2 `0.956/6.66/0.536`, C1 `0.946/9.17/0.143` | 33 | k3 LOOO report:8-10; convergence report:8-12; §III:70-76; §IV:69-75 | Confirmed after rounding |
|
||||
| 9 | K=3 C1 LOOO shape drift: cos <=0.005, dHash <=0.96, weight <=0.023 | 11, 33, 93, 127 | k3 LOOO report:77-79; §III:78, §III:122; §IV:139 | Confirmed |
|
||||
| 10 | K=3 held-out hard-posterior differences `1.8-12.8 pp` | 33, 93, 117 | k3 LOOO report:83-90; §III:122; §IV:134-139 | Confirmed after rounding |
|
||||
| 11 | Three-score Spearman convergence `rho >= 0.879` | 11, 35, 51, 97, 127 | convergence report:28-30; §III:100-104; §IV:83-87 | Confirmed numerically; wording must not say independent |
|
||||
| 12 | Per-signature K=3 consistency `Cohen kappa = 0.87` | 97 | §III:108-116; §IV:104-112 | Confirmed |
|
||||
| 13 | Pixel-identity subset `n = 262`, all three checks 0% miss, Wilson upper 1.45% | 11, 35, 53, 101, 127 | pixel-identity report:8, 14-16; §III:124-132; §IV:145-153 | Confirmed |
|
||||
| 14 | Firm A pixel-identical `145`, plus `50 partners` and `35 cross-year` | 85 | pixel-identity report:24 confirms 145; §III:44 and §III:190 mark 50/35 as inherited from v3 / Script 28, not regenerated in v4 spikes | Partially confirmed; provenance caveat needed |
|
||||
| 15 | Inter-CPA FAR `0.0005`, Wilson `[0.0003, 0.0007]` | 53, 101 | §III:188; §IV:157-159; inherited v3.20.0 §IV-F.1 Table X | Confirmed as inherited |
|
||||
| 16 | Full-dataset robustness `n = 686`, full rho `0.9558`, drift `0.007` | 11, 55, 107, 127 | full-dataset report:10-13, 25-31; §III:186; §IV:242-250 | Confirmed numerically, but interpretive scope is light |
|
||||
| 17 | Firm A `82%/82.5%` templated and Firm C `23.5%` hand-leaning | 85, 129 | convergence report:43-48; §IV:217-224 | Confirmed as descriptive K=3 hard assignment |
|
||||
|
||||
## Cross-Reference Checks (Phase 4 <-> §III v6 / §IV v3.2)
|
||||
|
||||
| Linkage | Phase 4 evidence | §III / §IV evidence | Status |
|
||||
|---|---:|---:|---|
|
||||
| Big-4 primary scope and sample size | Lines 11, 31, 37, 107, 125 | §III:30; §IV:9, §IV:15 | Numerically tight, but scope-test wording overbroad |
|
||||
| Accountant-level distributional characterisation refs | Line 29 | §III-I/J are the relevant methodology sections (§III:52-86); §IV-D/E correct (§IV:21-75) | Fail: `§III-D` is stale/wrong |
|
||||
| K=2 as firm-mass separator, not operational | Lines 31, 91 | §III:78-86, §III:120; §IV:118-127 | Tight |
|
||||
| K=3 descriptive only | Lines 33, 49, 93 | §III:64, §III:80-86, §III:156; §IV:75, §IV:139, §IV:224 | Tight, except "candidate classifier" wording |
|
||||
| Three-score internal consistency | Lines 11, 35, 51, 97, 127 | §III:90-106; §IV:79-102 | Numerically tight; independence wording fails |
|
||||
| Reverse-anchor reference as non-Big-4 | Lines 35, 97 | §III:48-50; §IV:89 | Tight |
|
||||
| Pixel-identity positive anchor | Lines 35, 101 | §III:124-134; §IV:141-155 | Tight; Firm A-only anchoring phrase should be narrowed |
|
||||
| Inter-CPA negative-anchor FAR | Lines 53, 101 | §III:126, §III:188; §IV:157-159 | Tight as inherited |
|
||||
| Five-way classifier primary / MC band inherited | Lines 33, 113 | §III:136-156; §IV:161-224 | Mostly tight; Abstract/Conclusion should name operational output more clearly |
|
||||
| Full-dataset robustness | Lines 55, 107, 127 | §IV:228-252 | Numerically tight; "pipeline reproducibility" overclaims light scope |
|
||||
| Internal notes and close-out artifacts | Lines 3, 133-142 | Round-25 review kept this open; §III and §IV also retain internal notes | Not partner/Phase-5 ready |
|
||||
|
||||
## Phase 5 Readiness
|
||||
|
||||
Partial.
|
||||
|
||||
The §III/§IV technical foundation would likely survive cross-AI peer review, but the current Phase 4 prose would draw a Major Revision because it reintroduces known overclaims and has an incomplete §II. With the targeted prose repairs below, Phase 5 readiness should move to Yes.
|
||||
|
||||
## Recommended Next-Step Actions
|
||||
|
||||
1. Replace every "independent feature-derived scores" phrase with "three feature-derived scores" or "three feature-derived summaries," and preserve the shared-feature caveat in Abstract/§I/§V/§VI.
|
||||
|
||||
2. Rewrite the Big-4 scope language at lines 31, 47, 81, 107, and 127 to match §III exactly: Big-4 is the smallest scope among the comparison scopes tested; B/C/D single-firm dip tests were not computed; no full-dataset dip-test result is reported.
|
||||
|
||||
3. Fix stale cross-references in line 29: use §III-G/I/J/K as appropriate instead of §III-D.
|
||||
|
||||
4. Turn §II into a real revised Related Work section: retain the v3.20.0 subsections in the master, splice in the LOOO paragraph, and replace `[add citation]` with a specific cross-validation citation.
|
||||
|
||||
5. Rebuild §V-G limitations by merging the v4-specific limitations with still-valid v3 limitations: transferred ResNet-50 features, HSV stamp-removal artifacts, longitudinal scan/PDF confounds, source-exemplar misattribution, and legal/regulatory interpretation.
|
||||
|
||||
6. Replace "published box rule" with "inherited Paper A box rule" unless an external publication citation is added.
|
||||
|
||||
7. Narrow full-dataset language: say "K=3 + box-rule rank-convergence reproduces at full `n = 686`" rather than "pipeline reproducibility at multiple scopes."
|
||||
|
||||
8. Before Phase 5, strip the Phase 4 draft note and close-out checklist (lines 3 and 133-142), and continue the same cleanup for §III/§IV internal notes flagged in round 25.
|
||||
@@ -0,0 +1,102 @@
|
||||
# Paper A Round 27 Review - v4 round 7
|
||||
|
||||
Reviewer: gpt-5.5
|
||||
Date: 2026-05-12
|
||||
Target: `paper/v4/paper_a_prose_v4_phase4.md` (Phase 4 prose v2 + abstract trim)
|
||||
Foundation checked: `paper/v4/paper_a_methodology_v4_section_iii.md` (§III v6) and `paper/v4/paper_a_results_v4_section_iv.md` (§IV v3.2)
|
||||
Prior rubric checked: `paper/codex_review_gpt55_v4_round6.md`
|
||||
|
||||
## Verdict
|
||||
|
||||
Minor Revision.
|
||||
|
||||
Phase 4 prose v2 closes the substantive round-26 overclaim cycle. The major technical-prose risks around independent-score language, Big-4 scope, K=3 operational status, full-dataset overread, and restored limitations are now aligned with §III v6 / §IV v3.2.
|
||||
|
||||
The remaining issues are packaging / copy-edit blockers, not empirical blockers: §II still marks [42]-[44] as placeholders and the reference list has not been extended past [41]; internal draft notes and the Phase 4 close-out checklist remain; and §V-F still uses "candidate classifiers" for K=3/reverse-anchor checks.
|
||||
|
||||
## Round-26 finding closure table
|
||||
|
||||
### Major findings
|
||||
|
||||
| # | Round-26 finding | v2 status | Round-27 note |
|
||||
|---:|---|---|---|
|
||||
| M1 | Abstract said "Three independent feature-derived scores" | CLOSED | Abstract now says "Three feature-derived scores" and adds "not statistically independent" (line 11). |
|
||||
| M2 | §I overclaimed Big-4 scope by implying any single firm and full-dataset dip-test non-rejection | CLOSED | §I now says "narrower comparison scopes tested" and names only Script 32 scopes (line 31). |
|
||||
| M3 | §I stale cross-reference to §III-D | CLOSED | Replaced with §III-G through §III-J plus §IV-D/E (line 29). |
|
||||
| M4 | §I repeated independent-score error | CLOSED | §I now states the three scores are not statistically independent and frames convergence as internal consistency (line 35). |
|
||||
| M5 | §II not submission-ready if inserted as written | PARTIAL | The v4 addition is real prose, but the file still contains a meta note and depends on master-file splicing of `paper/paper_a_related_work_v3.md` (lines 63-65). |
|
||||
| M6 | §II unresolved citation placeholder | OPEN | Body cites Stone/Geisser/Vehtari as [42]-[44], but line 65 says these are placeholders; `paper/paper_a_references_v3.md` stops at [41]. |
|
||||
| M7 | §V reified CPA mechanism labels | CLOSED | Wording now says per-CPA means are located in descriptor-plane regions, not that all signatures share a mechanism (line 79). |
|
||||
| M8 | §V speculative within-CPA unimodality explanation | CLOSED | The causal claim was removed; v2 only states joint consistency and repeats the summary-statistic caveat (line 79). |
|
||||
| M9 | §V limitations incomplete vs v3.20.0 | CLOSED | Restored inherited limitations: ImageNet transfer, HSV artifacts, longitudinal confounds, source-exemplar misattribution, legal/regulatory interpretation (lines 119-127). |
|
||||
| M10 | §V scope limitation implied full-dataset dip-test evidence | CLOSED | v2 explicitly says full `n = 686` dip-test marginals and LOOO were not tested (line 105). |
|
||||
| M11 | §VI overclaimed "cross-scope pipeline reproducibility" | CLOSED | Conclusion now limits the claim to K=3 + box-rule rank-convergence at full `n = 686` and excludes thresholds/LOOO/five-way/pixel checks (line 135). |
|
||||
|
||||
### Minor findings
|
||||
|
||||
| # | Round-26 finding | v2 status | Round-27 note |
|
||||
|---:|---|---|---|
|
||||
| m1 | Abstract "candidate classifiers" blurred operational status | CLOSED | Abstract no longer uses "candidate classifiers"; it names the five-way operational output first (line 11). |
|
||||
| m2 | Abstract had no word-count margin | CLOSED | `wc -w` on line 11 returns 243 words, leaving 7 words of margin. |
|
||||
| m3 | Abstract omitted primary operational output | CLOSED | Abstract now states the inherited five-way per-signature classifier with worst-case document aggregation (line 11). |
|
||||
| m4 | Contribution 4 overclaimed "not at narrower scopes" | CLOSED | Now "narrower comparison scopes tested" (line 47). |
|
||||
| m5 | Contribution 8 overclaimed full-dataset check | CLOSED | Now says only K=3 + box-rule rank-convergence reproduces and explicitly excludes other components (line 55). |
|
||||
| m6 | Safeguards paragraph used "external validation" too broadly | CLOSED | The paragraph now uses "annotation-free validation against naturally-occurring anchor populations" and does not imply full external validation (line 25). |
|
||||
| m7 | §II "calibration uncertainty band on operational rule" conflicted with classifier framing | CLOSED | Rewritten as "composition-sensitivity band on the candidate mixture boundary" and not a sufficiency claim for the five-way classifier (line 65). |
|
||||
| m8 | §V "inherits and confirms" too strong for signature-level spectrum | CLOSED | Now "inherits this signature-level reading and remains consistent with it," with no-new-diagnostic caveat (line 77). |
|
||||
| m9 | Firm A byte-level details needed provenance language | CLOSED | v2 marks 50 partners / 35 cross-year as inherited from v3.20.0 Script 28 and not regenerated in v4 spikes (line 83). |
|
||||
| m10 | Firm A alone did not anchor §IV-H | CLOSED | v2 says the Big-4 byte-identical anchor pools all four firms (line 85). |
|
||||
| m11 | "Published box rule" not traceable | CLOSED | Replaced with "inherited Paper A box rule" throughout. |
|
||||
| m12 | "Same per-CPA ranking" too strong | CLOSED | v2 now says "broadly concordant" and reports the Firm D/Firm C residual disagreement (line 95). |
|
||||
| m13 | §V repeated "candidate classifiers" wording | PARTIAL | Line 99 still says "all three candidate classifiers" for the inherited box rule, K=3 hard label, and reverse-anchor metric. Use "candidate checks" or "candidate scores/rules." |
|
||||
| m14 | Future-work audit-quality contrast needed descriptive caveat | CLOSED | Future work now says the Firm A/Firm C contrast is descriptive, not mechanism-level, and not linked to audit-quality outcomes (line 137). |
|
||||
| m15 | Conclusion underplayed operational output | CLOSED | Conclusion now names the inherited five-way per-signature classifier and worst-case document aggregation (line 133). |
|
||||
|
||||
### Round-26 next-step actions
|
||||
|
||||
| # | Action | v2 status | Note |
|
||||
|---:|---|---|---|
|
||||
| A1 | Replace independent-score language and preserve shared-feature caveat | CLOSED | Done in Abstract, §I, §V, §VI. |
|
||||
| A2 | Rewrite Big-4 scope language | CLOSED | Done; no unsupported B/C/D single-firm or full-dataset dip-test claim remains in body prose. |
|
||||
| A3 | Fix stale §III-D cross-reference | CLOSED | Done at line 29. |
|
||||
| A4 | Turn §II into real revised Related Work and replace `[add citation]` | PARTIAL | The LOOO paragraph is drafted, but references [42]-[44] remain placeholders and absent from the reference list. |
|
||||
| A5 | Rebuild §V-G limitations with still-valid v3 limitations | CLOSED | Done at lines 119-127. |
|
||||
| A6 | Replace "published box rule" | CLOSED | Done. |
|
||||
| A7 | Narrow full-dataset language | CLOSED | Done at lines 55, 105, and 135. |
|
||||
| A8 | Strip internal notes/checklists before Phase 5 | OPEN | Draft note and close-out checklist remain (lines 3, 141-150); §III/§IV also retain internal notes/checklists. |
|
||||
|
||||
## Newly introduced issues
|
||||
|
||||
1. **Minor - §II citation-number gap and placeholder contradiction.** The v2 draft note says §II now has "a real citation," but line 65 says [42]-[44] are placeholders, line 147 still says `[add citation]`, and `paper/paper_a_references_v3.md` stops at [41]. This is the only remaining reviewer-visible blocker if the prose is packaged as manuscript text.
|
||||
|
||||
2. **Minor - stale close-out metadata.** The close-out checklist says the abstract is "approximately 235 words" (line 145), but `wc -w` returns 243 words on the abstract paragraph. The author's "244 words" note and the shell count differ by one tokenization unit; both satisfy IEEE Access, but the checklist should be updated or removed.
|
||||
|
||||
No newly introduced empirical inconsistency was found.
|
||||
|
||||
## Abstract word count verification + key v2 spot checks
|
||||
|
||||
Abstract count: `sed -n '11p' paper/v4/paper_a_prose_v4_phase4.md | wc -w` returns **243**. The abstract is one paragraph and under the 250-word IEEE Access target.
|
||||
|
||||
Spot-check 1: **Independent-score correction closed.** Lines 11, 35, 95, and 135 now say the scores are feature-derived / shared-input / not statistically independent. This matches §III-K's caveat and §IV-F's framing that the correlations are internal consistency, not external validation.
|
||||
|
||||
Spot-check 2: **Big-4 scope and full-dataset correction closed.** Lines 31, 47, 79, 105, and 135 now match §III-G/I and §IV-D/K: Big-4 is the smallest scope among tested comparison scopes; B/C/D single-firm dip tests and full-dataset dip tests were not run; full-dataset evidence is only the light K=3 + box-rule Spearman re-run at `n = 686`.
|
||||
|
||||
Spot-check 3: **Operational-vs-descriptive framing closed except line 99 wording.** Lines 11, 33, 55, 111, 133, and 135 reserve operational status for the inherited five-way classifier and keep K=3 descriptive. The only remaining wording leak is line 99's "candidate classifiers."
|
||||
|
||||
## Phase 5 readiness
|
||||
|
||||
Partial.
|
||||
|
||||
Substantively, §III + §IV + Phase 4 prose are converged. Phase 5 should not require new statistical work. It does require one copy-edit/reference pass before packaging: finalize §II citations and references, strip internal notes/checklists, and replace the residual "candidate classifiers" phrase.
|
||||
|
||||
## Recommended next-step actions
|
||||
|
||||
1. Replace line 99's "all three candidate classifiers" with "all three candidate checks" or "all three candidate scores/rules"; keep K=3 explicitly descriptive.
|
||||
|
||||
2. Finalize §II packaging: either splice the full v3.20.0 Related Work body plus the v4 LOOO paragraph into the master, or make this Phase 4 file contain the full §II block. Add real [42]-[44] reference entries and remove the "placeholders" sentence.
|
||||
|
||||
3. Strip the Phase 4 draft note and close-out checklist before manuscript assembly; do the same for §III/§IV internal notes and working checklists.
|
||||
|
||||
4. Update or remove the stale abstract-count note. The verified shell count is 243 words.
|
||||
|
||||
5. After the reference/cross-reference cleanup, run one final manuscript-level lint for unresolved placeholders, duplicate reference numbers, internal notes, and stale section/table references.
|
||||
@@ -1,23 +1,12 @@
|
||||
# Section III. Methodology — v4.0 Draft v6 (post codex rounds 21–25)
|
||||
# Section III. Methodology — v4.0 Draft v7 (post codex rounds 21–34)
|
||||
|
||||
> **Draft note (2026-05-12, v6; internal — remove before submission).** This file replaces the §III-G through §III-L block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here.
|
||||
> **Draft note (2026-05-13, v7; internal — remove before submission).** This file replaces the §III-G through §III-M block of `paper/paper_a_methodology_v3.md` (v3.20.0). Sub-sections III-A through III-F (Pipeline / Data Collection / Page Identification / Detection / Feature Extraction / Dual-Method Descriptors) are unchanged from v3.20.0 and not reproduced here. The §III-G through §III-M block has been substantially restructured between v6 and v7 (2026-05-13): codex round-29 demolished the distributional path to thresholds (Scripts 39b–39e prove (cos, dHash) multimodality is composition + integer artefact); v4.0 pivots to anchor-based multi-level inter-CPA coincidence-rate calibration (Scripts 40b, 43, 44, 45, 46); §III-I is rewritten as the no-natural-threshold diagnostic; §III-J is recast as a firm-compositional descriptive partition (not three mechanism clusters); §III-L is a new major sub-section on anchor-based threshold calibration; §III-M is a new sub-section on validation strategy and limitations under the unsupervised setting. Prior internal draft notes (v2–v6 changelog) have been moved to `paper/v4/CHANGELOG.md`.
|
||||
>
|
||||
> **v2** incorporated codex gpt-5.5 round-21 review (`paper/codex_review_gpt55_v4_round1.md`, Major Revision); key revisions were: (i) the inherited five-way per-signature box rule restored as the **primary operational classifier** (§III-L), (ii) the K=3 Gaussian mixture positioned as **accountant-level descriptive characterisation** (§III-J), (iii) "convergent validation" softened to "convergent internal-consistency checks" since the three scores share underlying features (§III-K), (iv) the pixel-identity metric renamed from FAR to positive-anchor miss rate (§III-K), (v) five empirical/wording slips corrected.
|
||||
>
|
||||
> **v3** incorporates codex gpt-5.5 round-22 review (`paper/codex_review_gpt55_v4_round2.md`, Minor Revision); five narrow fixes applied: per-firm ranking corrected (Score 2 reverse-anchor ranks Firm D fractionally above Firm C while Scores 1 and 3 rank Firm C highest), "smallest scope" language narrowed to "comparison scopes tested in Script 32", §III-L Spearman correlations explicitly tied to the K=3 *posterior* P(C1), provenance for $n = 150{,}442$ cites Script 39 directly, "max fold-to-fold deviation" wording made precise ($0.028$ = max absolute deviation from across-fold mean; pairwise range $0.0376$).
|
||||
>
|
||||
> **v4** incorporates the §III ↔ §IV cross-reference cleanup that codex round-23 review flagged: §III-G unit references now point to actual §IV locations (§IV-J for five-way per-signature counts; §IV-I for inherited inter-CPA FAR), §III-G scope statement enumerates v4-new vs inherited sub-sections explicitly, §III-K cites v3.20.0 Tables IX/XI/XII/XII-B for moderate-band capture-rate (was "§IV-F" which is now Convergent Internal-Consistency), and §III-L's "without recalibration" claim is narrowed to apply only to the binary high-confidence sub-rule.
|
||||
>
|
||||
> **v5** incorporates codex gpt-5.5 round-24 review (`paper/codex_review_gpt55_v4_round4.md`, Minor Revision); seven narrow §III-side cleanups: (1) anonymisation leak repaired (real firm names/aliases removed from §III prose; Firm A–D used throughout); (2) K=3 LOOO weight-drift value $0.025$ corrected to $0.023$ at three §III sites (matches Script 37); (3) §III-K positive-anchor paragraph cross-ref repaired (now points to §IV-I and v3.20.0 §IV-F.1 Table X, was the meaningless "§III-J inherited; Table X"); (4) §III-L five-way rule's Likely-hand-signed band made inclusive ($\text{cos} \leq 0.837$, matches Script 42); (5) open question 1's location pointer changed from current §IV-F to v3.20.0 Tables IX/XI/XII/XII-B and §IV-J descriptive proportions; (6) provenance row added for the full-dataset $n = 686$ claim citing Script 41; (7) draft-note dates and version stamps refreshed.
|
||||
>
|
||||
> **v6** incorporates codex gpt-5.5 round-25 review (`paper/codex_review_gpt55_v4_round5.md`, Minor Revision): empirical anchor range updated to Scripts 32–42 (was 32–40, missed Scripts 41 and 42).
|
||||
>
|
||||
|
||||
> Empirical anchors throughout reference Scripts 32–42 on branch `paper-a-v4-big4`; a provenance table appears at the end of this section listing every numerical claim with its script and report path.
|
||||
> Empirical anchors throughout reference Scripts 32–46 on branch `paper-a-v4-big4`; a curated provenance table appears at the end of this section listing the principal numerical claims with their script and report path.
|
||||
|
||||
## G. Unit of Analysis and Scope
|
||||
|
||||
We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and v3.20.0's inherited inter-CPA FAR analysis referenced in §IV-I). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
|
||||
We analyse signatures at two units of resolution. The **signature** — one signature image extracted from one report — is the operational unit of classification (§III-L) and of the signature-level analyses in §IV (notably §IV-J for the five-way per-signature category counts and the inherited inter-CPA negative-anchor coincidence-rate analysis referenced in §IV-I; reported under prior "FAR" terminology in v3.x). The **accountant** — one CPA aggregated over all of their signatures in the corpus — is the unit of mixture-model characterisation (§III-J), of per-CPA internal-consistency analysis (§III-K), and of the leave-one-firm-out reproducibility check (§III-K). At the accountant level we compute, for each CPA with $n_{\text{sig}} \geq 10$ signatures, the per-CPA mean of the per-signature best-match cosine ($\overline{\text{cos}}_a$) and the per-CPA mean of the independent-minimum dHash ($\overline{\text{dHash}}_a$). The minimum threshold of 10 signatures per CPA is required for the per-CPA mean to be a stable summary; CPAs below this threshold are excluded from the accountant-level analyses but remain in the per-signature analyses.
|
||||
|
||||
We make no within-year or across-year uniformity assumption about CPA signing mechanisms. Per-signature labels are signature-level quantities throughout this paper; we do not translate them to per-report or per-partner mechanism assignments, and we abstain from partner-level frequency inferences (such as "X% of CPAs hand-sign") that would require such a translation. A CPA's per-CPA mean is a *summary statistic* of their observed signatures, not a claim that all of their signatures share a single mechanism.
|
||||
|
||||
@@ -27,73 +16,107 @@ We adopt one stipulation about same-CPA pair detectability:
|
||||
|
||||
A1 is plausible for high-volume stamping or firm-level electronic signing workflows but is not guaranteed when (i) the corpus contains only one observed replicated report for a CPA, (ii) multiple template variants are used in parallel, or (iii) scan-stage noise pushes a replicated pair outside the detection regime. A1 is the only assumption the per-signature detector requires to be sensitive to replication.
|
||||
|
||||
**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor FAR), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ (Scripts 36, 38), totalling 150,442 Big-4 signatures with both descriptors available (Script 39 reports the explicit per-signature $n$ used in the signature-level K=3 fit). Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations:
|
||||
**Scope: the Big-4 sub-corpus.** v4.0's primary analyses (§III-I, §III-J, §III-K, §III-L, and the v4-new analyses in §IV-D through §IV-J) are restricted to the four largest accounting firms in Taiwan, pseudonymously labelled Firm A through Firm D throughout the manuscript. §IV-A through §IV-C, §IV-I (inter-CPA negative-anchor coincidence rate), and §IV-L (feature-backbone ablation) report inherited corpus-wide v3.x material that v4.0 does not re-scope to Big-4. §IV-K reports a deliberately narrow full-dataset cross-check at $n = 686$ CPAs. The Big-4 sub-corpus comprises 437 CPAs (171 / 112 / 102 / 52 across Firms A through D) with $n_{\text{sig}} \geq 10$ — the threshold for accountant-level analyses (Scripts 36, 38) — totalling 150,442 Big-4 signatures with both pre-computed descriptors available. Restricting the v4-new analyses to Big-4 is a methodological choice driven by four considerations:
|
||||
|
||||
1. **Within-pool homogeneity for mixture characterisation.** Pooling Big-4 with mid- and small-firm CPAs introduces a heterogeneous tail of $\sim$249 CPAs distributed across multiple firms with idiosyncratic signing practices and small per-firm samples. The full-sample and Big-4-only calibrations *differ* in their fitted marginal crossings (full-sample published $\overline{\text{cos}}^* = 0.945$, $\overline{\text{dHash}}^* = 8.10$ from v3.x; Big-4-only $\overline{\text{cos}}^* = 0.975$, $\overline{\text{dHash}}^* = 3.76$ from Script 34; bootstrap 95% CIs $[0.974, 0.977]$ / $[3.48, 3.97]$, $n_{\text{boot}} = 500$); the offset is large compared to the Big-4 bootstrap CI half-width of $0.0015$. We report this as a *scope-dependent shift* rather than asserting a causal "mid/small-firm tail distorts" claim.
|
||||
1. **Leave-one-firm-out fold feasibility.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 K=3 fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.
|
||||
|
||||
2. **Statistical multimodality at the accountant level.** Within the Big-4 sub-corpus, the Hartigan dip test rejects unimodality on both axes (cosine $p = 0.0000$, dHash $p = 0.0000$ in the bootstrap-2000 implementation, i.e., no bootstrap replicate exceeded the observed statistic; reported here as $p < 5 \times 10^{-4}$; Script 34). No such rejection holds in the comparison scopes tested by Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled (Script 32, `big4_non_A` subset: $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$); all non-Firm-A pooled (Script 32, `all_non_A` subset: other Big-4 plus mid/small firms; $p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$). Among the comparison scopes we evaluated, Big-4 is the smallest scope at which the dip test supports applying a finite-mixture model to the per-CPA distribution; we did not separately test single-firm dip statistics for Firms B, C, or D.
|
||||
2. **Firm A as templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane (§III-J K=3 component cross-tab; v3.x byte-level pair analysis referenced in §III-H). v4.0 retains Firm A within the Big-4 scope as a descriptive case study of the templated end, rather than treating Firm A as the calibration anchor for thresholds (the v3.x role of Firm A).
|
||||
|
||||
3. **Reproducibility under leave-one-firm-out cross-validation.** §III-K reports leave-one-firm-out (LOOO) cross-validation of the Big-4 mixture fit. The Big-4 sub-corpus permits a four-fold LOOO at the firm level (one fold per Big-4 firm). No analogous firm-level fold is available outside Big-4 because mid/small firms have CPA counts of $O(1)$–$O(30)$ per firm.
|
||||
3. **Within-firm cross-CPA collision structure analysis.** §III-L.4 reports a Big-4 cross-firm hit-matrix analysis (Script 44) that quantifies the within-firm cross-CPA template-like collision pattern. The four-firm setting affords the cleanest signal for this analysis; replicating the same matrix structure on the heterogeneous mid/small-firm tail is left as future work.
|
||||
|
||||
4. **Restricted generalizability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same mixture structure or operational thresholds extend to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check and (b) as a robustness comparison in §IV-K. Generalisation beyond Big-4 is left as future work.
|
||||
4. **Restricted generalisability claim.** v4.0's primary claims are scoped to the Big-4 audit-report context; we do not assert that the same descriptive mixture structure or operational alert behaviour extends to mid/small firms. The 249 non-Big-4 CPAs enter only (a) as an external reference population in §III-H's reverse-anchor internal-consistency check, (b) as a robustness comparison in §IV-K, and (c) as a corroborating-population check on the dHash discrete-mass-point artefact in §III-I.4 (Script 39c). Generalisation beyond Big-4 is left as future work.
|
||||
|
||||
We earlier (v4.0 first draft) listed "statistical multimodality at the accountant level" among the scope justifications, on the basis that the Hartigan dip test rejects unimodality on the Big-4 accountant-level marginals. §III-I.4 reports diagnostics (Scripts 39b–39e) that explain the rejection as a joint effect of between-firm composition shift and dHash integer mass points, not as evidence of within-population continuous bimodality. We therefore no longer list dip-test multimodality among the Big-4 scope rationales; the K=3 mixture is retained as a descriptive partition (§III-J), not as inferential evidence for two mechanism modes.
|
||||
|
||||
**Sample-size reconciliation.** Two Big-4 signature counts appear in this section and §IV: $n = 150{,}442$ for analyses using the pre-computed per-signature descriptors $\text{cos}_s$ (`max_similarity_to_same_accountant`) and $\text{dHash}_s$ (`min_dhash_independent`), and $n = 150{,}453$ for analyses recomputing pair-level metrics directly from the stored feature and dHash byte vectors (Scripts 40b, 43, 44). The $11$-signature difference reflects descriptor-completion status: $11$ signatures have feature vectors and dHash byte vectors stored but lack the pre-computed extrema. The $11$ signatures are negligible at population scale and do not affect any reported coincidence rate within $0.01$ percentage point. The CPA counts $468$ (all Big-4 CPAs with both vectors stored) and $437$ (Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability) likewise reflect a single uniform exclusion rule rather than analysis-specific subsetting.
|
||||
|
||||
## H. Reference Populations
|
||||
|
||||
v4.0 distinguishes two reference populations in its calibration, replacing v3.x's single-anchor framing.
|
||||
|
||||
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the most digitally-replicated of the Big-4. In the Big-4 K=3 mixture (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 hand-leaning component (cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 mixed component, and 82.5% of the C3 replicated component; the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28."
|
||||
**Internal reference: Firm A as the templated-end case study.** Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 descriptive partition (§III-J; Scripts 35, 38), Firm A accounts for 0% of the C1 component (low-cos / high-dHash corner; cos $\approx 0.946$, dHash $\approx 9.17$, weight $\approx 0.143$), 17.5% of the C2 component (central region), and 82.5% of the C3 component (high-cos / low-dHash corner); the opposite pattern holds at Firm C (Script 35: 23.5% C1, 75.5% C2, 1.0% C3, hereafter referred to as "the Firm whose CPAs are most concentrated in C1"). The byte-level pair analysis reported in v3.x §IV-F.1 identifies 145 Firm A pixel-identical signatures at the signature level (Script 40 verifies the 145/262 split among Big-4 pixel-identical signatures); the additional details that v3.x attributes to this analysis (50 distinct Firm A partners of 180 registered; 35 byte-identical matches spanning different fiscal years) are inherited from the Script 28 / Appendix B byte-decomposition output and were not regenerated in the v4.0 spike scripts. We retain those v3.x details by reference and mark them in the provenance table as "inherited from v3 §IV-F.1 / Script 28."
|
||||
|
||||
In v4.0, Firm A is *not* the calibration anchor for the operational threshold. Firm A enters the Big-4 mixture on equal footing with Firms B through D; the K=3 components are derived from the joint Big-4 distribution (§III-J), not from Firm A alone. Firm A's role in the methodology is descriptive: it is the Big-4 firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the descriptor plane, and the byte-level pair evidence above provides the firm-level signature-reuse evidence that anchors §III-K's pixel-identity positive-anchor miss rate.
|
||||
|
||||
**External reference: non-Big-4 as the reverse-anchor reference for internal-consistency checking.** The 249 non-Big-4 CPAs ($n_{\text{sig}} \geq 10$, drawn from $\sim$30 mid- and small-firms) constitute a population strictly outside the Big-4 target. Their per-CPA $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ distribution defines a 2D Gaussian reference (fit by Minimum Covariance Determinant with support fraction 0.85 for robustness; Script 38). This reference is used in §III-K's reverse-anchor internal-consistency check: each Big-4 CPA's location relative to the reference centre, measured as the marginal cosine cumulative-distribution-function value under the reference, is one of three feature-derived scores that v4.0 uses as a cross-check on the inherited per-signature classifier. The reverse-anchor reference is *not* a positive or negative anchor for threshold derivation — its role is to provide a strictly out-of-target benchmark against which the within-Big-4 mixture-derived ranking can be internally cross-checked.
|
||||
|
||||
The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 hand-leaning component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 templated component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a more hand-leaning Big-4 CPA. This is a "deviation in the hand-leaning direction" measure, not a "deviation toward replication" measure; the reference is the less-replicated population.
|
||||
The reverse-anchor reference centre is at $\overline{\text{cos}} = 0.935$, $\overline{\text{dHash}} = 9.77$ (Script 38). The reference sits at a lower cosine and higher dHash than the Big-4 K=3 low-cos / high-dHash component (cos $= 0.946$, dHash $= 9.17$; §III-J); compared to the Big-4 high-cos / low-dHash component (cos $= 0.983$, dHash $= 2.41$; §III-J) the reference is markedly less replication-dominated. The reverse-anchor metric for a given Big-4 CPA is the percentile of $\overline{\text{cos}}_a$ within the reference marginal cosine distribution, sign-flipped so that lower percentile (further into the left tail of the reference) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end of the descriptor plane. This is a "deviation in the less-replication-dominated descriptor-position direction" measure, not a "deviation toward the templated descriptor-position" measure; the reference is the less-replication-dominated population.
|
||||
|
||||
## I. Distributional Characterisation at the Accountant Level
|
||||
## I. Distributional Diagnostics: Why the Composition Path Does Not Yield a Natural Threshold
|
||||
|
||||
This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G. Three diagnostic procedures are applied: a univariate unimodality test on each marginal axis, a 2D Gaussian mixture fit (developed in §III-J), and a density-smoothness diagnostic.
|
||||
This section characterises the joint distribution of accountant-level descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ across the 437 Big-4 CPAs of §III-G and tests whether the distribution provides distributional support — in the form of within-population bimodality — for the operational thresholds inherited from v3.x. We apply four diagnostic procedures in turn: a univariate unimodality test on each accountant-level marginal; a 2D Gaussian mixture fit (developed in §III-J); a density-smoothness diagnostic; and a composition decomposition that distinguishes within-population multimodality from between-firm location-shift artefacts (the v4-new diagnostic battery). The four diagnostics jointly imply that the operational thresholds are *not* anchored by distributional bimodality: §III-L develops an anchor-based calibration framework that does not require this assumption.
|
||||
|
||||
**1. Hartigan dip test on each marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope rejected unimodality. The dip-test multimodality at the Big-4 level is the empirical justification for fitting a finite-mixture model in §III-J; without it, the mixture would be a forced fit on an essentially unimodal distribution.
|
||||
**1. Hartigan dip test on each accountant-level marginal.** We apply the Hartigan & Hartigan dip test [37] to each of the two marginal distributions $\{\overline{\text{cos}}_a\}_{a=1}^{437}$ and $\{\overline{\text{dHash}}_a\}_{a=1}^{437}$, with bootstrap-based $p$-value estimation ($n_{\text{boot}} = 2000$). In both cases no bootstrap replicate exceeded the observed dip statistic, so the empirical $p$-value is bounded above by $5 \times 10^{-4}$; we report this in tables as $p < 5 \times 10^{-4}$ rather than $p = 0$ to reflect the bootstrap resolution (Script 34). For comparison, no rejection of unimodality holds in the comparison scopes tested in Script 32: Firm A pooled alone ($p_{\text{cos}} = 0.992$, $p_{\text{dHash}} = 0.924$, $n = 171$); Firms B + C + D pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.906$, $n = 266$); all non-Firm-A CPAs pooled ($p_{\text{cos}} = 0.998$, $p_{\text{dHash}} = 0.907$, $n = 515$). Single-firm dip tests for Firms B, C, and D were not separately computed; the comparison scopes above sufficed to establish that no narrower-than-Big-4 *tested* scope at the accountant level rejected unimodality. The accountant-level Big-4 rejection is a descriptive observation; §III-I.4 below shows that the rejection is fully explained by between-firm location-shift effects rather than within-population bimodality.
|
||||
|
||||
**2. Mixture-model evidence.** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3, and the operational role of each fit is developed in §III-J and §III-K.
|
||||
**2. K=2 / K=3 Gaussian mixture fits (descriptive partition).** A 2-component 2D Gaussian Mixture Model (full covariance, $n_{\text{init}} = 15$, fixed seed 42; Script 34) recovers components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$, weight $0.689$, and $(0.983, 2.41)$, weight $0.311$. The marginal crossings of the K=2 fit are $\overline{\text{cos}}^* = 0.9755$ and $\overline{\text{dHash}}^* = 3.755$, with bootstrap 95% confidence intervals $[0.9742, 0.9772]$ and $[3.48, 3.97]$ over $n_{\text{boot}} = 500$ resamples. The 3-component fit (§III-J) is BIC-preferred — using the convention that lower BIC is preferred, $\text{BIC}(K{=}3) - \text{BIC}(K{=}2) = -3.48$ (Script 36). The $\Delta$BIC magnitude is small in absolute terms; we do not treat $\Delta\text{BIC} = 3.5$ alone as decisive evidence for K=3 as a population mixture. Following §III-I.4 we treat both K=2 and K=3 fits as *descriptive partitions* of the joint Big-4 distribution that reflect firm-composition structure (Firm A vs others; §III-J) rather than as inferential evidence for two or three latent population modes.
|
||||
|
||||
**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with the mixture-model evidence: the K=3 components overlap rather than separate sharply, so a local-discontinuity test does not flag a transition. We retain BD/McCrary in v4.0 as a non-parametric robustness diagnostic; the dHash transitions outside Big-4 are not used as operational thresholds because they are scope-dependent and lie within rather than between modes of the corresponding density.
|
||||
**3. Burgstahler-Dichev / McCrary density-smoothness diagnostic.** We apply the discontinuity test of [38, 39] as a *density-smoothness diagnostic* (rather than as a threshold estimator) on each accountant-level marginal axis (cosine in bins of $0.002$, dHash in integer bins). At the Big-4 scope, the diagnostic identifies no significant transition on either marginal at $\alpha = 0.05$ (Script 34). Outside Big-4, the diagnostic does flag dHash transitions in some subsets (Script 32: `big4_non_A` dHash transition at $10.8$; `all_non_A` dHash transition at $6.6$; pre-2018 and post-2020 time-stratified variants also exhibit one or more dHash transitions), but no cosine transition is identified in any subset. The Big-4-scope null on both axes is consistent with §III-I.4 below: under the composition decomposition the Big-4 marginals are unimodal once between-firm and integer-tie confounds are removed, so a local-discontinuity test correctly fails to flag a within-population transition.
|
||||
|
||||
## J. Mixture Model and Accountant-Level Characterisation
|
||||
**4. Composition decomposition (Scripts 39b–39e).** §III-I.1 establishes that the accountant-level marginals reject unimodality at the Big-4 sub-corpus. The remaining question is whether the rejection reflects (a) genuine within-population bimodality at the signature or accountant level, (b) between-firm location-shift artefacts (firms with different mean descriptor positions pool to a multi-peaked distribution), or (c) integer mass-point artefacts on the integer-valued dHash axis (the dHash dip statistic is sensitive to spikes at integer values). We apply four diagnostics that decompose the rejection into these candidate sources:
|
||||
|
||||
This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive characterisations of the joint Big-4 distribution; the operational per-signature classifier remains the inherited five-way box rule of §III-L.** Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis.
|
||||
*Within-firm signature-level dip (Scripts 39b, 39c).* Repeating the dip test at the signature level inside each individual Big-4 firm (Script 39b) and inside each individual non-Big-4 firm with $\geq 500$ signatures (Script 39c) yields a consistent picture. The cosine marginal *fails* to reject unimodality in every single firm tested — all four Big-4 firms ($p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ for Firms A through D; Script 39b) and ten non-Big-4 firms with $\geq 500$ signatures ($p_{\text{cos}} \in [0.59, 0.99]$; Script 39c). The raw dHash marginal *does* reject unimodality in every firm tested ($p < 5 \times 10^{-4}$ in all $14$ firms), but the raw dHash values are integer-valued in $\{0, 1, \ldots, 64\}$, leaving open the possibility of an integer-tie artefact.
|
||||
|
||||
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ ("hand-leaning"), weight $0.689$, and $(0.983, 2.41)$ ("replicated"), weight $0.311$ (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$.
|
||||
*Integer-jitter robustness (Scripts 39d, 39e).* Adding independent uniform jitter $\sim \mathrm{U}[-0.5, +0.5]$ to break exact dHash ties and re-running the dip test on the perturbed signature cloud (5 seeds, $n_{\text{boot}} = 2000$; Script 39d) eliminates the dHash within-firm rejection in every Big-4 firm tested (Firm A jittered $p_{\text{median}} = 0.999$; B $0.996$; C $0.999$; D $0.9995$; $0$/$5$ seeds reject at $\alpha = 0.05$ in any firm). All ten non-Big-4 firms similarly fail to reject after jitter ($p \in [0.71, 1.00]$). The pooled-Big-4 dHash dip *does* survive jitter alone ($p_{\text{median}} = 0$, $5$/$5$ seeds reject), but Firm A's mean dHash ($2.73$) is substantially below Firms B/C/D's ($6.46$, $7.39$, $7.21$) — a between-firm location shift. Script 39e applies a 2 \times 2 factorial correction (firm-mean centring $\times$ integer jitter) on the Big-4 pooled dHash:
|
||||
|
||||
| Condition | Firm-mean centred | Integer jitter | Median dip $p$ | Reject at $\alpha = 0.05$ |
|
||||
|---|---|---|---|---|
|
||||
| 1 raw | — | — | $< 5 \times 10^{-4}$ | $5/5$ |
|
||||
| 2 centred only | $\checkmark$ | — | $< 5 \times 10^{-4}$ | $5/5$ |
|
||||
| 3 jittered only | — | $\checkmark$ | $< 5 \times 10^{-4}$ | $5/5$ |
|
||||
| 4 centred and jittered | $\checkmark$ | $\checkmark$ | $\mathbf{0.35}$ | $\mathbf{0/5}$ |
|
||||
|
||||
Removing *both* the between-firm location shift *and* the integer mass points eliminates the Big-4 dHash rejection. The Big-4 pooled dHash multimodality is therefore fully attributable to firm-composition contrast (primarily Firm A's mean $\text{dHash} = 2.73$ versus Firms B/C/D $\approx 6.5$–$7.4$) and integer-density artefacts, with no residual continuous within-firm bimodality.
|
||||
|
||||
*Cosine analogue.* The cosine axis exhibits the same pattern. Codex-verified read-only spike on the Big-4 pooled signature cloud yields signature-level cosine dip $p < 5 \times 10^{-4}$ on the raw data, but $p = 0.597$ after firm-mean centring; accountant-level cosine $p = 1.0$ after firm-mean centring. The cosine multimodality is therefore between-firm composition-driven, not within-population bimodality.
|
||||
|
||||
*Integer-histogram valleys (Script 39d).* A genuine within-firm dHash antimode would appear as a strict local minimum in the count histogram with deep relative depth. Within each of the four Big-4 firms, the dHash histogram on bins $0$–$20$ exhibits no strict local minimum; the Big-4 pooled histogram exhibits one shallow valley at $\text{dHash} = 4$ with relative depth $0.021$ (a $2.1\%$ count drop). No valley near the inherited $\text{dHash} = 5$ operational boundary appears within any individual firm. The hypothesised dHash antimode near $\text{dHash} \approx 5$ is not empirically supported by the histogram analysis.
|
||||
|
||||
**5. Conclusion: no natural threshold from the descriptor distribution.** §III-I.4 jointly establishes that (a) the Big-4 accountant-level dip rejection is fully attributable to between-firm composition and integer mass-point artefacts; (b) within any individual firm, the descriptor marginals at the signature level are unimodal once integer ties are broken; and (c) no integer-histogram valley near the inherited $\text{dHash} = 5$ operational boundary exists within any firm. The descriptor distributions therefore do not contain a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits of §III-I.2 and §III-J are retained as *descriptive partitions* that reflect firm-composition contrast, not as inferential evidence for two or three population modes. §III-L develops the v4.0 anchor-based threshold calibration framework, which derives operational rates from inter-CPA pair-level negative-anchor coincidences rather than from a distributional antimode.
|
||||
|
||||
## J. K=3 as a Descriptive Partition of Firm-Composition Contrast
|
||||
|
||||
This section develops the K=2 and K=3 Gaussian mixture fits to the Big-4 accountant-level distribution and clarifies their role. **Both fits are descriptive partitions of the joint Big-4 distribution; they reflect firm-composition contrast — primarily Firm A versus Firms B, C, D — rather than within-population mechanism modes.** §III-I.4 demonstrates that the apparent multimodality of the accountant-level marginals is fully explained by between-firm location shifts and integer mass-point artefacts, leaving no residual evidence for two or three latent within-population mechanism classes. Neither mixture is used to assign signature-level or document-level labels in the v4.0 primary analysis. The operational classifier of §III-L is calibrated via inter-CPA negative-anchor coincidence rates, not via mixture-derived antimodes.
|
||||
|
||||
**K=2 fit.** Two components at $(\overline{\text{cos}}, \overline{\text{dHash}}) = (0.954, 7.14)$ (weight $0.689$) and $(0.983, 2.41)$ (weight $0.311$) (Script 34). $\text{BIC}(K{=}2) = -1108.45$. Marginal crossings: $\overline{\text{cos}}^* = 0.9755$, $\overline{\text{dHash}}^* = 3.755$. We refer to the components by index rather than by mechanism labels, since §III-I.4 establishes that the K=2 separation is firm-compositional rather than mechanistic.
|
||||
|
||||
**K=3 fit.** Three components, sorted by ascending cosine mean (Script 35; Script 38 reproduces):
|
||||
|
||||
| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label |
|
||||
| Component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|
||||
|---|---|---|---|---|
|
||||
| C1 | 0.9457 | 9.17 | 0.143 | hand-leaning |
|
||||
| C2 | 0.9558 | 6.66 | 0.536 | mixed |
|
||||
| C3 | 0.9826 | 2.41 | 0.321 | replicated |
|
||||
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
|
||||
| C2 | 0.9558 | 6.66 | 0.536 | central region |
|
||||
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |
|
||||
|
||||
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support for K=3 under standard BIC interpretation but not by itself decisive).
|
||||
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild numerical preference for K=3 under standard BIC interpretation, but not by itself decisive). The "descriptive position" column replaces v3.x's "hand-leaning / mixed / replicated" mechanism labels: §III-I.4 establishes that the cosine and dHash axes both lack within-population bimodality, so component centres are best interpreted as locations in a continuous descriptor space rather than as latent mechanism modes.
|
||||
|
||||
**Why we report both K=2 and K=3.** Leave-one-firm-out cross-validation (§III-K) shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold *stability tolerance* — not the bootstrap CI; the full-Big-4 bootstrap cosine half-width is the much smaller $0.0015$. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is more composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL`, with the explicit interpretation that "the C1 cluster exists but membership is not well-predicted by the held-out fit" and "K=3 is not predictively useful as an operational classifier" (Script 37 verdict legend).
|
||||
**Per-firm component composition (Script 35 firm × cluster cross-tab).** The K=3 partition is dominated by firm membership:
|
||||
|
||||
We take the joint K=2/K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites a v4.0 operational classifier:
|
||||
- Firm A: $0\%$ C1, $17.5\%$ C2, $82.5\%$ C3
|
||||
- Firm B: $8.9\%$ C1, $\sim 78\%$ C2, $\sim 13\%$ C3
|
||||
- Firm C: $23.5\%$ C1, $75.5\%$ C2, $1.0\%$ C3
|
||||
- Firm D: $11.5\%$ C1, $\sim 84\%$ C2, $\sim 4.5\%$ C3
|
||||
|
||||
Firm A accounts for $141$ of the $143$ C3-assigned CPAs; Firm C accounts for $24$ of the $40$ C1-assigned CPAs. The K=3 partition is therefore well-described as a firm-compositional decomposition: C3 is essentially "Firm A and any non-Firm-A CPA whose mean descriptors happen to land in the high-cos / low-dHash corner"; C1 is essentially "non-Firm-A CPAs whose mean descriptors land in the low-cos / high-dHash corner." The composition contrast that K=3 captures at the accountant level reappears at the deployment level in the cross-firm hit matrix of §III-L.4 (Script 44): nearly all (98%) of the inter-CPA-anchor hits for a Firm A source signature have a Firm A candidate, and the same within-firm concentration holds for Firms B, C, D individually. The K=3 partition and the cross-firm hit matrix therefore describe the same underlying firm-compositional structure at two different units of analysis.
|
||||
|
||||
**Leave-one-firm-out stability (Scripts 36, 37).** Leave-one-firm-out cross-validation shows that K=2 is unstable across folds: holding Firm A out gives a fold rule cos $> 0.938$ AND dHash $\leq 8.79$, while holding any single non-Firm-A Big-4 firm out gives a fold rule near cos $> 0.975$ AND dHash $\leq 3.76$ (Script 36). The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$ (the corresponding pairwise across-fold range is $0.0376$, from $0.9380$ for the held-out-Firm-A fold to $0.9756$ for the held-out-Firm-D fold; Script 36 stability summary). The $0.028$ value is $5.6\times$ the report's $0.005$ across-fold stability tolerance. K=3 in contrast has a *reproducible component shape*: across the four folds the C1 cosine mean varies by at most $0.005$, the C1 dHash mean by at most $0.96$, and the C1 weight by at most $0.023$ (Script 37). K=3 hard-posterior membership for the held-out firm is composition-sensitive — for Firm C the held-out C1 rate is $36.3\%$ vs the full-Big-4 baseline of $23.5\%$, an absolute difference of $12.8$ pp; for Firm A the held-out C1 rate is $4.7\%$ vs baseline $0.0\%$; the report's own legend classifies this pattern as `P2_PARTIAL` ("the C1 cluster exists but membership is not well-predicted by the held-out fit"). We accordingly do not use K=3 hard-posterior membership as an operational label.
|
||||
|
||||
We take the joint K=2 / K=3 LOOO evidence as supporting the following descriptive claims, all of which are used in §III-K and §V but none of which underwrites the v4.0 operational classifier:
|
||||
|
||||
- The Big-4 K=2 marginal crossing $(0.975, 3.76)$ is essentially a firm-mass separator between Firm A and Firms B + C + D, not a within-Big-4 mechanism boundary.
|
||||
- The Big-4 K=3 mixture exhibits a reproducible three-component shape across LOOO folds; a hand-leaning component (C1) exists at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$ with weight $\approx 0.14$.
|
||||
- Hard-posterior membership in C1 is composition-sensitive (max absolute deviation $12.8$ pp across LOOO folds, exceeding the report's $5$ pp viability bar); K=3 is therefore not used to assign operational hand-leaning labels to CPAs in v4.0.
|
||||
- The Big-4 K=3 mixture exhibits a reproducible three-component component shape across LOOO folds at the descriptor-position level, with C1 reproducibly located at $\overline{\text{cos}} \approx 0.946$, $\overline{\text{dHash}} \approx 9.17$.
|
||||
- Hard-posterior K=3 membership is composition-sensitive across folds (max absolute deviation $12.8$ pp); K=3 is therefore not used to assign operational labels to CPAs in v4.0.
|
||||
|
||||
The operational signature-level classifier remains the inherited five-way box rule of §III-L, calibrated as in v3.x. Cross-checks between the inherited rule and the K=3 mixture appear in §III-K.
|
||||
The operational signature-level classifier of §III-L is calibrated against inter-CPA pair-level negative-anchor coincidence rates, not against mixture-derived antimodes. Cross-checks between the inherited five-way box rule and the K=3 partition appear in §III-K.
|
||||
|
||||
## K. Convergent Internal-Consistency Checks
|
||||
|
||||
The mixture characterisation of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).
|
||||
The descriptive partition of §III-J is supported by three feature-derived per-CPA scores and a hard-ground-truth subset analysis. We caution at the outset that the three scores are **not statistically independent measurements** — all three are deterministic functions of the same per-CPA descriptor means $(\overline{\text{cos}}_a, \overline{\text{dHash}}_a)$ — so their high pairwise rank correlations are partly a mechanical consequence of shared inputs. Per §III-I.4, none of the three scores has a within-population bimodality interpretation; they are firm-compositional position scores at the accountant level. The checks below therefore document **internal consistency among feature-derived ranks**, not external validation against an independent hand-signed ground truth (which the corpus does not provide).
|
||||
|
||||
**1. Three feature-derived per-CPA scores (Script 38).** For each Big-4 CPA we compute:
|
||||
|
||||
- **Score 1 (mixture posterior):** $P(\text{C1}_{\text{hand-leaning}})$ from the K=3 fit of §III-J — a function of both descriptor means.
|
||||
- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a more hand-leaning Big-4 CPA. This is a function of $\overline{\text{cos}}_a$ alone.
|
||||
- **Score 3 (Paper A box-rule hand-leaning rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.
|
||||
- **Score 1 (K=3 posterior on the low-cos / high-dHash component):** $P(\text{C1})$ from the K=3 fit of §III-J. Per §III-J this is a firm-compositional position score on the (cos, dHash) plane (not a probability of any latent "hand-signing mechanism") — a function of both descriptor means.
|
||||
- **Score 2 (reverse-anchor cosine percentile):** the marginal cosine CDF value of $\overline{\text{cos}}_a$ under the non-Big-4 reference Gaussian of §III-H, sign-flipped so that lower percentile (further into the reference's left tail) corresponds to a Big-4 CPA whose mean cosine sits further from the templated end. This is a function of $\overline{\text{cos}}_a$ alone.
|
||||
- **Score 3 (inherited binary high-confidence box rule rate):** the per-CPA fraction of signatures that do **not** satisfy the inherited binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$). This is a per-signature-aggregated function of the same descriptors.
|
||||
|
||||
Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs (Script 38):
|
||||
|
||||
@@ -103,7 +126,7 @@ Pairwise Spearman rank correlations among the three scores, $n = 437$ Big-4 CPAs
|
||||
| Score 2 vs Score 3 | $+0.889$ | $< 10^{-149}$ |
|
||||
| Score 1 vs Score 2 | $+0.879$ | $< 10^{-142}$ |
|
||||
|
||||
We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA hand-leaning ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated and the three non-Firm-A Big-4 firms as more hand-leaning, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule hand-leaning rate (Scores 1 and 3) place Firm C at the most-hand-leaning end of Big-4 (mean P(C1) $= 0.311$; mean box-rule hand-leaning rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the most-hand-leaning end between the three non-A Big-4 firms.
|
||||
We read this as the strongest internal-consistency signal in v4.0: three different summarisations of the same descriptor pair agree on the per-CPA descriptor-position ranking with $\rho > 0.87$. The three scores agree on placing Firm A as the most replication-dominated descriptor position and the three non-Firm-A Big-4 firms further from the templated end, but they do not all rank the non-Firm-A firms identically: the K=3 posterior P(C1) and the box-rule less-replication-dominated rate (Scores 1 and 3) place Firm C at the less-replication-dominated end of Big-4 (mean P(C1) $= 0.311$; mean box-rule less-replication-dominated rate $= 0.790$), while the reverse-anchor cosine percentile (Score 2) places Firm D fractionally higher than Firm C (mean reverse-anchor score $-0.7125$ vs Firm C $-0.7672$, with higher value indicating deeper into the reference left tail). The mean values for Firms B and D sit between Firms A and C on Scores 1 and 3 (Script 38 per-firm summary). We do not claim this constitutes external validation of any operational classifier; the inherited box rule is calibrated separately (§III-L), and the convergence above shows that a mixture-derived score and a reverse-anchor score concur with the box rule's per-CPA-aggregated outputs on the directional ordering, with a modest disagreement at the less-replication-dominated end between the three non-A Big-4 firms.
|
||||
|
||||
**2. Per-signature consistency (Script 39).** Per-CPA aggregation could in principle reflect averaging across within-CPA heterogeneity rather than coherent within-CPA behaviour. We test this by repeating the K=3 fit at the signature level — fitting a fresh K=3 GMM to the 150,442 Big-4 signature-level $(\text{cos}, \text{dHash}_{\text{indep}})$ points (Script 39) — and comparing labels. The per-CPA and per-signature K=3 fits recover a broadly similar three-component ordering; per-CPA C1 is at $\overline{\text{cos}} = 0.946$, $\overline{\text{dHash}} = 9.17$ vs per-signature C1 at $\overline{\text{cos}} = 0.928$, $\overline{\text{dHash}} = 9.75$ (an absolute cosine drift of $0.018$). Cohen $\kappa$ on the binary collapse (replicated vs not-replicated):
|
||||
|
||||
@@ -119,50 +142,210 @@ The Script 39 report verdict is `SIG_CONVERGENCE_MODERATE`. The $\kappa = 0.870$
|
||||
|
||||
- *K=2 LOOO is unstable.* The maximum absolute deviation of the four fold cosine crossings from their across-fold mean is $0.028$, against the report's $0.005$ across-fold stability tolerance (Script 36; pairwise fold range $0.0376$, from $0.9380$ to $0.9756$). When Firm A is held out, the fold rule classifies $171/171$ of held-out Firm A CPAs as templated; when any non-Firm-A Big-4 firm is held out, the fold rule classifies $0$ of the held-out firm's CPAs as templated. This pattern indicates the K=2 boundary is essentially a Firm-A-vs-others separator rather than a within-Big-4 mechanism boundary.
|
||||
|
||||
- *K=3 LOOO is partially stable.* The C1 hand-leaning component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.
|
||||
- *K=3 LOOO is partially stable.* The C1 (low-cos / high-dHash) component shape is reproducible across folds: max deviation from the full-Big-4 baseline is $0.005$ in cosine, $0.96$ in dHash, and $0.023$ in mixture weight (Script 37). Hard-posterior membership remains composition-sensitive — observed absolute differences are $1.8$–$12.8$ pp across the four folds, with the Firm C fold exceeding the report's $5$ pp viability bar; the report's own verdict is `P2_PARTIAL` ("K=3 is not predictively useful as an operational classifier"). We accordingly do not use K=3 hard-posterior membership as an operational label.
|
||||
|
||||
**4. Positive-anchor miss rate on byte-identical signatures (Script 40).** The corpus provides one hard ground-truth subset: signatures whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce pixel-identical images, so byte-identical signatures are conservative-subset ground truth for the *replicated* class. The Big-4 byte-identical subset comprises $n = 262$ signatures ($145 / 8 / 107 / 2$ across Firms A through D; Script 40).
|
||||
|
||||
We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures misclassified as hand-leaning. This is a one-sided check against a conservative positive subset, **not a false-alarm rate in the usual two-class sense**; we do not report a paired false-alarm rate because no signature-level hand-signed ground truth exists. The corresponding signature-level false-alarm-rate evidence comes from the v3.x inter-CPA negative anchor (§IV-I, inheriting v3.20.0 §IV-F.1 Table X), which retains its v3.x interpretation:
|
||||
We report each candidate classifier's *positive-anchor miss rate* — the fraction of byte-identical signatures classified as belonging to the less-replication-dominated descriptor positions. This is a one-sided check against a conservative positive subset, **not a paired specificity metric in the usual two-class sense**; we do not report a paired negative-anchor metric here because no signature-level hand-signed ground truth exists. The corresponding signature-level inter-CPA negative-anchor ICCR evidence is developed in §III-L.1 (Big-4 sample) and the v3.x §IV-I corpus-wide version (reported under prior "FAR" terminology):
|
||||
|
||||
| Candidate classifier | Pixel-identity miss rate (Wilson 95% CI) |
|
||||
|---|---|
|
||||
| Inherited Paper A binary high-confidence box rule (cos $> 0.95$ AND dHash $\leq 5$) | $0\%$ $[0\%, 1.45\%]$ |
|
||||
| K=3 per-CPA hard label (C3 = replicated; descriptive only) | $0\%$ $[0\%, 1.45\%]$ |
|
||||
| K=3 per-CPA hard label (C3 high-cos / low-dHash corner; descriptive only) | $0\%$ $[0\%, 1.45\%]$ |
|
||||
| Reverse-anchor with prevalence-calibrated cut | $0\%$ $[0\%, 1.45\%]$ |
|
||||
|
||||
All three candidate scores correctly assign every byte-identical signature to the replicated class. We caution that for the inherited box rule this result is close to tautological: byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$ by construction, so any threshold strictly below cos $= 1$ and strictly above dHash $= 0$ will capture them; v3.x discussed this conservative-subset caveat at length (v3 §III-J item 1, V-F). The positive-anchor miss rate is therefore a necessary check (a classifier that *failed* this check would be disqualified), not a sufficient validation of the classifier's behaviour on the non-byte-identical replicated population. The reverse-anchor cut here is chosen by prevalence calibration against the inherited box rule's overall replicated rate ($49.58\%$ of Big-4 signatures; Script 40); this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
|
||||
|
||||
## L. Signature- and Document-Level Classification
|
||||
## L. Anchor-Based Threshold Calibration and Operational Classifier
|
||||
|
||||
The v4.0 operational classifier is the inherited v3.x five-way per-signature box rule, retained unchanged for two reasons: (a) it preserves continuity with the v3.x literature and its established interpretation; (b) the convergent internal-consistency checks of §III-K show that the box rule's *binary high-confidence* output (cos $> 0.95$ AND dHash $\leq 5$) agrees at $\rho \geq 0.96$ per-CPA with a K=3-posterior score and at $\rho \geq 0.89$ with a reverse-anchor score. The §III-K checks cover only the binary high-confidence rule; the five-way classifier's moderate-confidence band ($5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation are not separately validated by Scripts 38–42. We retain those rule components by reference to v3.20.0's calibration (v3.20.0 §III-K and Tables IX, XI, XII, XII-B); we do not claim that v4.0's convergent-checks evidence supports the inherited rule as a whole, only its binary high-confidence sub-rule.
|
||||
§III-I.4 established that the descriptor distributions do not contain a within-population bimodal antimode that could anchor an operational threshold; the K=3 mixture of §III-J is a descriptive firm-compositional partition, not a mechanism-cluster model. This section develops v4.0's anchor-based threshold calibration: the operational thresholds inherited from v3.x are characterised by their inter-CPA pair-level negative-anchor coincidence behaviour and their pool-normalised per-signature and per-document alert behaviour, at multiple units of analysis. Throughout this section we report **inter-CPA coincidence rates** rather than "False Acceptance Rates"; we explain the terminological choice in §III-L.0.
|
||||
|
||||
**Per-signature five-way classifier.** Operational thresholds are anchored on whole-sample Firm A percentile heuristics as in v3.x: cos $> 0.95$ (Firm A P7.5) for the cosine dimension and dHash$_{\text{indep}} \leq 5$ / $> 15$ for the structural dimension. All dHash references refer to the *independent-minimum* dHash defined in §III-G. We assign each signature to one of five signature-level categories using convergent evidence from both descriptors:
|
||||
### L.0. Calibration methodology
|
||||
|
||||
1. **High-confidence non-hand-signed:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
|
||||
**Operational classifier (inherited from v3.20.0 §III-K, retained unchanged).** Each Big-4 signature is assigned to one of five categories using the per-signature descriptor pair $(\text{cos}_s, \text{dHash}_s)$ where $\text{cos}_s$ is the maximum cosine similarity to another signature by the same CPA and $\text{dHash}_s$ is the minimum independent dHash to another signature by the same CPA:
|
||||
|
||||
2. **Moderate-confidence non-hand-signed:** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
|
||||
1. **High-confidence non-hand-signed (HC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} \leq 5$. Both descriptors converge on strong replication evidence.
|
||||
2. **Moderate-confidence non-hand-signed (MC):** Cosine $> 0.95$ AND $5 < \text{dHash}_{\text{indep}} \leq 15$. Feature-level evidence is strong; structural similarity is present but below the high-confidence cutoff.
|
||||
3. **High style consistency (HSC):** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
|
||||
4. **Uncertain (UN):** Cosine between the all-pairs intra/inter KDE crossover ($0.837$) and $0.95$.
|
||||
5. **Likely hand-signed (LH):** Cosine $\leq 0.837$.
|
||||
|
||||
3. **High style consistency:** Cosine $> 0.95$ AND $\text{dHash}_{\text{indep}} > 15$. High feature-level similarity without structural corroboration — consistent with a CPA who signs very consistently but not via image reproduction.
|
||||
The thresholds ($\text{cos} = 0.95$ as the cosine operating point, $\text{cos} = 0.837$ as the all-pairs KDE crossover, $\text{dHash} = 5$ and $15$ as structural-similarity sub-band cutoffs) are inherited from v3.x §III-K and retain their v3.x calibration provenance. Document-level labels are aggregated via the v3.x worst-case rule: each audit report inherits the most-replication-consistent category among its certifying-CPA signatures (rank order HC > MC > HSC > UN > LH).
|
||||
|
||||
4. **Uncertain:** Cosine between the all-pairs intra/inter KDE crossover (0.837) and 0.95 without sufficient convergent evidence in either direction.
|
||||
**Why retained without v4.0 recalibration.** The inherited thresholds preserve continuity with v3.x reporting and with the existing literature. §III-I.4 establishes that a v4.0 recalibration cannot be anchored on distributional antimodes (no within-population bimodality exists); §III-L.1 confirms that the cosine threshold's specificity behaviour at the inter-CPA pair level (the v3.x calibration anchor) is reproducible on the v4 spike sample, and §III-L.1 newly characterises the structural-dimension threshold $\text{dHash} \leq 5$'s pair-level coincidence behaviour. Sub-band thresholds ($\text{dHash} = 15$, $\text{cos} = 0.837$) retain v3.x's inherited calibration; v4.0 does not provide independent calibration for those sub-bands.
|
||||
|
||||
5. **Likely hand-signed:** Cosine at or below the all-pairs KDE crossover threshold (cos $\leq 0.837$).
|
||||
**Three units of analysis.** We report inter-CPA negative-anchor coincidence behaviour at three units, each addressing a different operational question:
|
||||
|
||||
The conventions about these thresholds (cosine 0.95 as an operating point chosen for capture-vs-FAR tradeoff against the inter-CPA negative anchor, with Wilson 95% inter-CPA FAR of $0.0005$ at the operating point; cosine 0.837 as the all-pairs KDE crossover; dHash 5 and 15 as the upper tail of the high-similarity mode and the structural-similarity ceiling respectively) are inherited from v3.x §III-K and retain their v3.x calibration and capture-rate evidence (v3.x Tables IX, XI, XII, XII-B).
|
||||
- *Per comparison.* For a randomly drawn pair of signatures from different CPAs, what fraction satisfies the rule (cos $>$ cos\_threshold and / or dHash $\leq$ dHash\_threshold)? This is the unit at which v3.x §IV-I characterised the cosine threshold's specificity behaviour and at which threshold-derivation in biometric verification is conventionally calibrated. We report it for both the cosine and dHash dimensions, marginally and jointly (§III-L.1).
|
||||
- *Per signature pool.* For a Big-4 source signature $s$ with same-CPA pool of size $n_{\text{pool}}(s)$, what is the probability that the deployed rule fires *under the counterfactual* of replacing the source's same-CPA pool with $n_{\text{pool}}(s)$ random non-same-CPA candidates? This addresses the standard concern that a per-pair rate computed on independent pairs is not the deployed-rule rate at the per-signature classifier level: the deployed rule takes max-cosine and min-dHash over a pool of size $n_{\text{pool}}(s)$, so its effective coincidence rate is approximately $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit (§III-L.2).
|
||||
- *Per document.* For an audit report aggregated via the worst-case rule, what fraction of documents have at least one signature whose deployed pool-normalised rule fires under the same inter-CPA candidate-replacement counterfactual? This is the operational alarm-rate unit (§III-L.3).
|
||||
|
||||
**Document-level aggregation.** Each audit report typically carries two certifying-CPA signatures. We aggregate signature-level outcomes to document-level labels using the v3.x worst-case rule: the document inherits the *most-replication-consistent* signature label among the two signatures (rank order: High-confidence $>$ Moderate-confidence $>$ Style-consistency $>$ Uncertain $>$ Likely-hand-signed). The aggregation rule reflects the detection goal of flagging any potentially non-hand-signed report.
|
||||
**Any-pair vs same-pair semantics.** The deployed rule uses independent extrema: a signature satisfies the HC rule if $\max_{\text{pool}} \text{cos} > 0.95$ AND $\min_{\text{pool}} \text{dHash} \leq 5$, *not* if a single candidate in the pool satisfies both. We refer to this as the **any-pair** rule. A stricter alternative — the **same-pair** rule — requires a single candidate to satisfy both inequalities; the deployed v3/v4 rule is any-pair, but we report same-pair as a stricter alternative classifier where useful (§III-L.2, §III-L.4).
|
||||
|
||||
**K=3 as accountant-level characterisation, not classifier.** The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used for the accountant-level cluster cross-tabulation (Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K.
|
||||
**Terminological note on "FAR".** The v3.x and biometric-verification literature speak of "False Acceptance Rate" (FAR) for a per-pair rate computed on independent inter-CPA pairs. We adopt **inter-CPA coincidence rate (ICCR)** as the v4.0 metric name and *do not* use "FAR" in the manuscript prose, for two reasons: (a) FAR has a specific biometric-verification meaning that requires ground-truth negative labels (which the corpus does not provide at the signature level); (b) §III-L.4 shows that the inter-CPA negative-anchor assumption — that inter-CPA pairs are negative — is partially violated by within-firm cross-CPA template-like collision structures. Reading "inter-CPA coincidence rate" as a *specificity proxy* under an explicitly disclosed assumption is faithful to the evidence; reading it as a true biometric FAR would overstate the evidence. We retain the v3.x numerical results (which are quantitatively reproduced in §III-L.1) under the new terminology.
|
||||
|
||||
### L.1. Per-comparison inter-CPA coincidence rate (Script 40b)
|
||||
|
||||
We sample $5 \times 10^5$ inter-CPA pairs uniformly at random from Big-4 signatures, computing for each pair the cosine similarity (feature dot product) and Hamming distance between the dHash byte vectors. Marginal and joint rates at threshold $k$ are reported with Wilson 95% confidence intervals (Script 40b).
|
||||
|
||||
| Threshold | Per-comparison inter-CPA coincidence rate | 95% Wilson CI |
|
||||
|---|---|---|
|
||||
| Cosine $> 0.95$ | $0.00060$ | $[0.00053, 0.00067]$ |
|
||||
| Cosine $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ |
|
||||
| Cosine $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
|
||||
| Cosine $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
|
||||
| dHash $\leq 5$ | $0.00129$ | $[0.00120, 0.00140]$ |
|
||||
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
|
||||
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
|
||||
| dHash $\leq 2$ | $0.00006$ | $[0.00004, 0.00008]$ |
|
||||
| Joint: cos $> 0.95$ AND dHash $\leq 5$ | $0.00014$ | (any-pair semantics) |
|
||||
| Joint: cos $> 0.95$ AND dHash $\leq 4$ | $0.00011$ | (any-pair semantics) |
|
||||
|
||||
The cosine row at $\text{cos} > 0.95$ replicates the v3.x §IV-I Table X result (v3.x reported the per-comparison rate as $0.0005$ under prior "FAR" terminology from a similarly-sized inter-CPA negative anchor; the v4 spike on a $5 \times 10^5$-pair sample yields $0.00060$, within the v3.x reported precision). The dHash and joint rows are v4-new: v3.x calibration did not provide an inter-CPA pair-level coincidence rate for the structural dimension or the joint rule.
|
||||
|
||||
The all-firms-scope sample yields slightly lower per-comparison coincidence rates (cos $> 0.95$: $0.00031$; dHash $\leq 5$: $0.00073$; joint: $0.00007$); the all-firms sample weights small CPAs more heavily under CPA-uniform pair sampling, so we treat the Big-4 sample as the primary calibration scope and report all-firms as a corroborating-scope robustness check.
|
||||
|
||||
**Conditional inter-CPA coincidence rate.** A natural follow-up question is whether the dHash dimension provides marginal specificity beyond the cosine gate. For pairs with cos $> 0.95$, the conditional rate of dHash $\leq 5$ is $0.234$ (Wilson 95% CI $[0.190, 0.285]$; $70$ of $299$ pairs in the Big-4 sample). At cos $> 0.95$, dHash provides $\sim 4.3\times$ further per-comparison specificity (joint $0.00014$ vs cos-only $0.00060$).
|
||||
|
||||
The per-comparison rate is a useful *specificity-proxy calibration* for the deployed rule's pair-level behaviour. It does *not* directly translate to the deployed-rule specificity at the per-signature classifier level, because the deployed classifier takes extrema over a same-CPA pool of size $n_{\text{pool}}$. The pool-normalised inter-CPA alert rate is reported in §III-L.2.
|
||||
|
||||
### L.2. Pool-normalised inter-CPA alert rate (Script 43)
|
||||
|
||||
The deployed rule uses $\max_{\text{pool}} \text{cos}$ and $\min_{\text{pool}} \text{dHash}$ over the same-CPA pool of size $n_{\text{pool}}(s)$ for each signature $s$. A per-comparison rate is therefore not the rate at which the deployed classifier fires per signature. To compute the per-signature inter-CPA-equivalent rate, for each Big-4 source signature $s$ we simulate one realisation of an inter-CPA candidate pool of the same size $n_{\text{pool}}(s)$, drawn uniformly from non-same-CPA signatures across all firms, compute the deployed extrema and rule indicator, and aggregate (Script 43; $n_{\text{sig}} = 150{,}453$ vector-complete in this analysis; CPA-block bootstrap 95% CIs reported below).
|
||||
|
||||
**Headline rates (deployed any-pair rule, HC = cos $> 0.95$ AND dHash $\leq 5$).** Wilson 95% CIs on the point estimate, CPA-block bootstrap 95% CI on $n_{\text{boot}} = 1000$ replicates:
|
||||
|
||||
| Rule semantics | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|
||||
|---|---|---|---|
|
||||
| Any-pair (deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
|
||||
| Same-pair (stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
|
||||
|
||||
Per-firm any-pair rates (no bootstrap; descriptive):
|
||||
|
||||
| Firm | $n_{\text{sig}}$ | Any-pair ICCR | Same-pair ICCR |
|
||||
|---|---|---|---|
|
||||
| Firm A | $60{,}450$ | $0.2594$ | $0.2018$ |
|
||||
| Firm B | $34{,}254$ | $0.0147$ | $0.0023$ |
|
||||
| Firm C | $38{,}616$ | $0.0053$ | $0.0019$ |
|
||||
| Firm D | $17{,}133$ | $0.0110$ | $0.0051$ |
|
||||
|
||||
**Pool-size decile dependence.** The deployed rule's pool-normalised rate is monotonically (broadly) increasing in $n_{\text{pool}}$, consistent with the $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ form expected under inter-CPA independence (Script 43 decile table). Decile 1 (smallest pools, $n_{\text{pool}} \in [0, 201]$): any-pair ICCR $= 0.0249$. Decile 10 (largest, $n_{\text{pool}} \in [846, 1115]$): any-pair ICCR $= 0.1905$. The trend is broadly monotonic with two minor non-monotone reversals (decile 5 and decile 9 dip below their predecessors).
|
||||
|
||||
**Threshold sensitivity at per-signature unit.** Tightening the HC rule from $\text{dHash} \leq 5$ to $\text{dHash} \leq 3$ (same-pair) reduces the per-signature ICCR from $0.0827$ to $0.0449$ (Big-4 pooled); tightening to $\text{dHash} \leq 4$ gives $0.0639$ (same-pair). Stakeholders requiring a per-signature ICCR of $\leq 0.05$ at HC can adopt dHash $\leq 3$ same-pair as a stricter operating point; at $\leq 0.10$ the inherited HC any-pair rule with $\text{dHash} \leq 5$ at $0.1102$ is within tolerance.
|
||||
|
||||
### L.3. Document-level inter-CPA proxy alert rate (Script 45)
|
||||
|
||||
The deployed worst-case aggregation classifies each document by the most-replication-consistent category among its constituent signatures (§III-L.0). Three operationally meaningful document-level alarm definitions are reported, each as the fraction of documents whose worst-case signature category falls in the alarm set under the same inter-CPA candidate-pool counterfactual as §III-L.2 (Script 45; $n_{\text{docs}} = 75{,}233$ Big-4 documents):
|
||||
|
||||
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|
||||
|---|---|---|---|
|
||||
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
|
||||
| D2 | HC + MC ("any non-hand-signed verdict") | $0.3375$ | $[0.3342, 0.3409]$ |
|
||||
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
|
||||
|
||||
Per-firm D2 document-level rates:
|
||||
|
||||
| Firm | $n_{\text{docs}}$ | D2 (HC + MC) ICCR |
|
||||
|---|---|---|
|
||||
| Firm A | $30{,}226$ | $0.6201$ |
|
||||
| Firm B | $17{,}127$ | $0.1600$ |
|
||||
| Firm C | $19{,}501$ | $0.1635$ |
|
||||
| Firm D | $8{,}379$ | $0.0863$ |
|
||||
|
||||
The document-level D2 rate of $33.75\%$ pooled over Big-4 is the most operationally relevant alarm-rate metric: it is the fraction of audit documents that would carry at least one signature flagged HC or MC under the counterfactual of inter-CPA candidate-pool replacement. The non-trivial per-document inter-CPA alarm rate (and its concentration in Firm A at $62\%$) motivates the positioning of the operational system as a **screening framework with human-in-the-loop review**, not as an autonomous forensic classifier (§III-M).
|
||||
|
||||
### L.4. Firm heterogeneity (Script 44)
|
||||
|
||||
§III-L.2 and §III-L.3 report large per-firm variation in the deployed rule's pool-normalised behaviour: Firm A's any-pair per-signature ICCR is $0.2594$, an order of magnitude larger than Firm B's $0.0147$, Firm C's $0.0053$, Firm D's $0.0110$. A natural alternative explanation is the pool-size confound: Firm A's median pool size ($\sim 285$) is larger than other firms', and pool size monotonically (broadly) increases the per-signature rate (§III-L.2 decile trend). We test the firm-vs-pool confound with a logistic regression of the per-signature hit indicator (any-pair HC) on firm dummies (Firm A = reference) and centred log pool size (Script 44):
|
||||
|
||||
| Term | Odds ratio (vs Firm A) | Direction | Magnitude |
|
||||
|---|---|---|---|
|
||||
| Firm B | $0.053$ | $< 1$ | $\sim 19\times$ lower odds than Firm A |
|
||||
| Firm C | $0.010$ | $< 1$ | $\sim 100\times$ lower odds than Firm A |
|
||||
| Firm D | $0.027$ | $< 1$ | $\sim 37\times$ lower odds than Firm A |
|
||||
| log(pool size, centred) | $4.01$ | $> 1$ | $\sim 4\times$ higher odds per unit log pool size |
|
||||
|
||||
The Firm B/C/D odds ratios are very small after controlling for pool size, indicating that firm membership accounts for a large multiplicative effect on the per-signature rate that is *not* explained by pool size alone. (We report odds ratios rather than $z$-scores because per-signature observations are clustered by CPA and firm, and naive standard errors would be inflated by within-cluster correlation; a cluster-robust standard error analysis is left as a robustness check.)
|
||||
|
||||
The per-decile per-firm breakdown (Script 44) confirms the pattern: within every pool-size decile, Firms B/C/D have rates of $0.0006$–$0.0358$, while Firm A's rate ranges $0.0541$–$0.5958$ across deciles. The firm gap is large within matched pool sizes, not driven by pool composition.
|
||||
|
||||
**Cross-firm hit matrix.** Among Big-4 source signatures whose any-pair rule fires under the inter-CPA candidate-pool counterfactual, the candidate firm of the max-cosine partner is distributed as follows (Script 44):
|
||||
|
||||
| Source firm | Firm A candidate | Firm B | Firm C | Firm D | non-Big-4 | hits |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
|
||||
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
|
||||
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
|
||||
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
|
||||
|
||||
For the same-pair joint event (a single candidate satisfying both $\text{cos} > 0.95$ and $\text{dHash} \leq 5$), the candidate firm is even more strongly concentrated within the source firm: Firm A source $\to$ Firm A candidate in $11{,}314$ of $11{,}319$ same-pair hits ($99.96\%$); Firm B source $\to$ Firm B candidate in $85$ of $87$ ($97.7\%$); Firm C source $\to$ Firm C candidate in $54$ of $55$ ($98.2\%$); Firm D source $\to$ Firm D candidate in $64$ of $66$ ($97.0\%$).
|
||||
|
||||
**Interpretation.** The cross-firm hit matrix shows that nearly all inter-CPA collisions under the deployed rule originate from candidates within the source firm (different CPA, same firm). This pattern is consistent with — but not by itself diagnostic of — firm-specific template, stamp, or document-production reuse: within-firm scanning workflows, common form templates, and shared report-generation infrastructure could produce visually similar signature crops across different CPAs within the same firm. The byte-level evidence of v3.x §IV-F.1 (Firm A's $145$ pixel-identical signatures across $\sim 50$ distinct certifying partners) provides direct evidence that firm-level template reuse does occur at Firm A; the broader inter-CPA collision pattern in §III-L.4 is consistent with that mechanism extending in milder form to Firms B/C/D. We report this as "inter-CPA collision concentration is within-firm" — a descriptive observation about deployed-rule behaviour — and refrain from inferring that the within-firm hits constitute deliberate or systematic template sharing.
|
||||
|
||||
This connects back to §III-J: the K=3 firm-composition contrast at the accountant level (Firm A dominating C3; Firm C dominating C1) reappears at the deployment level in the cross-firm hit matrix, where nearly all collisions are within-firm. The K=3 partition and the cross-firm hit matrix describe the same underlying firm-compositional structure at two different units of analysis.
|
||||
|
||||
### L.5. Alert-rate sensitivity around inherited thresholds (Script 46)
|
||||
|
||||
To test whether the inherited cosine threshold $0.95$ and dHash threshold $5$ coincide with a low-gradient (plateau-stable) region of the deployed-rule alert-rate surface — which would be weak distributional evidence that the inherited thresholds are stable operating points — we sweep each threshold across a range and report the per-signature alert rate on actual observed Big-4 same-CPA pools (not inter-CPA-replaced pools), comparing the local gradient at the inherited threshold to the median gradient across the sweep (Script 46).
|
||||
|
||||
At the inherited HC operating point cos $> 0.95$ AND dHash $\leq 5$, the local gradient of the per-signature alert rate is substantially larger than the median gradient across the sweep (cosine: ratio $\approx 25\times$ at the $0.95$ point relative to median; dHash: ratio $\approx 3.8\times$ at the $5$ point relative to median; both Script 46). Reading these ratios descriptively, the inherited HC threshold is *locally sensitive* rather than plateau-stable: small threshold perturbations materially change the deployed alert rate (cosine sweep at dHash $\leq 5$ yields rates of $0.5091$ at cos $> 0.945$ vs $0.4789$ at cos $> 0.955$, a $3.0$ pp swing across a $0.01$ cosine perturbation; dHash sweep at cos $> 0.95$ yields rates of $0.4207$ at dHash $\leq 4$ vs $0.5639$ at dHash $\leq 6$, a $14.3$ pp swing across a single integer step). The local-gradient-to-median-gradient ratios are descriptive diagnostics, not formal plateau tests; the primary evidence for "no within-population bimodal antimode at these thresholds" comes from §III-I.4's composition decomposition, not from §III-L.5.
|
||||
|
||||
The MC/HSC boundary at dHash $= 15$, by contrast, *is* in a low-gradient region (ratio $\approx 0.08$ to the median); the plateau-like behaviour around dHash $= 15$ is corroborating evidence that the high-end structural threshold lies in a regime where the rule's alert rate is approximately saturated, consistent with the high-dHash tail behaviour expected once near-identical pairs have been exhausted. The §III-L.5 non-plateau / local-sensitivity finding therefore applies specifically to the HC cutoff (cos $= 0.95$, dHash $= 5$); the MC/HSC sub-band boundary at dHash $= 15$ exhibits the opposite behaviour and is plateau-like.
|
||||
|
||||
We interpret the inherited HC thresholds as **specificity-anchored operating points** chosen for the specificity-vs-alert-yield tradeoff (§III-L.1), *not* as distributional antimodes. Stakeholders requiring different operating points on the tradeoff curve can derive thresholds by inverting the per-comparison or pool-normalised ICCR curves (§III-L.1, §III-L.2) at their preferred specificity target.
|
||||
|
||||
### L.6. Observed deployed alert rate on actual same-CPA pools
|
||||
|
||||
The pool-normalised inter-CPA rates of §III-L.2 and §III-L.3 use the counterfactual of replacing the source signature's same-CPA pool with random non-same-CPA candidates. The **observed deployed alert rate** uses the source's actual same-CPA pool, i.e., the rate at which the deployed rule fires on the real corpus. For Big-4, the inherited HC any-pair rule fires on $49.58\%$ of signatures and $62.28\%$ of documents (Script 46; Script 42 reproduces the per-signature rate at $49.58\%$).
|
||||
|
||||
The per-signature observed-deployed rate is $\sim 4.5\times$ the pool-normalised inter-CPA rate ($0.4958$ vs $0.1102$); the per-document observed-deployed rate is $\sim 3.5\times$ the pool-normalised inter-CPA D1 (HC) rate ($0.6228$ vs $0.1797$). We refer to this multiplicative gap as the **deployed-rate excess over the inter-CPA proxy**:
|
||||
|
||||
- Per-signature: $0.4958 - 0.1102 = 0.3856$ ($38.6$ pp excess)
|
||||
- Per-document HC: $0.6228 - 0.1797 = 0.4431$ ($44.3$ pp excess)
|
||||
|
||||
We *do not* interpret the deployed-rate excess as a presumed true-positive rate; the inferential limits of this interpretation are developed in §III-M. The deployed-rate excess is best read as a *same-CPA repeatability signal* — a quantity that exceeds what random inter-CPA candidate replacement would produce — rather than as an estimate of true replication prevalence.
|
||||
|
||||
### L.7. K=3 not used as classifier
|
||||
|
||||
The K=3 mixture of §III-J is reported in §IV as an accountant-level descriptive summary alongside the per-signature five-way classifier. We do not assign signature-level or document-level labels from the K=3 mixture in any v4.0 result table; the K=3 hard label is used only for the accountant-level firm × cluster cross-tabulation (§III-J; Script 35), and the K=3 *posterior* P(C1) is used (as the continuous Score 1) in the internal-consistency Spearman correlations of §III-K. The operational classifier of §III-L.0 is the inherited v3.x five-way box rule; the calibration evidence in §III-L.1 through §III-L.6 characterises its multi-level coincidence behaviour against the inter-CPA negative anchor.
|
||||
|
||||
## M. Validation Strategy and Limitations under Unsupervised Setting
|
||||
|
||||
The v4.0 corpus lacks signature-level ground-truth replication labels: no signature is annotated as definitively hand-signed or definitively templated. The conservative positive anchor (pixel-identical same-CPA signatures; §III-K.4 and v3.x §IV-F.1) is by construction near $\text{cos} = 1$ and $\text{dHash} = 0$, providing a tautological capture-check rather than a sensitivity estimate for the non-byte-identical replicated class. The corpus therefore does not admit standard supervised classifier validation: we cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, or precision against ground truth.
|
||||
|
||||
In place of supervised validation, v4.0 adopts a **multi-tool collection of partial-evidence diagnostics**, each with an explicitly disclosed assumption:
|
||||
|
||||
| Tool | What it measures | Untested assumption |
|
||||
|---|---|---|
|
||||
| Per-comparison inter-CPA coincidence rate (§III-L.1; Script 40b) | Pair-level specificity proxy under a random-pair negative anchor | Inter-CPA pairs are negative (i.e., not template-related); partially violated by within-firm sharing (§III-L.4) |
|
||||
| Pool-normalised per-signature ICCR (§III-L.2; Script 43) | Deployed-rule specificity proxy at per-signature unit, accounting for pool size | Same as above + that pool replacement preserves the negative-anchor property |
|
||||
| Document-level ICCR (§III-L.3; Script 45) | Operational alarm rate proxy at per-document unit under three alarm definitions | Same as above |
|
||||
| Firm-heterogeneity logistic regression (§III-L.4; Script 44) | Multiplicative effect of firm membership on per-signature rate, controlling for pool size | Per-signature observations are clustered by CPA/firm; naïve standard errors inflated; cluster-robust analysis is a future check |
|
||||
| Cross-firm hit matrix (§III-L.4; Script 44) | Concentration of inter-CPA collisions within source firm | None — direct descriptive observation |
|
||||
| Alert-rate sensitivity sweep (§III-L.5; Script 46) | Local sensitivity of deployed rule to threshold perturbation | Gradient comparison is descriptive, not a formal plateau test |
|
||||
| Convergent score Spearman ranking (§III-K.1; Script 38) | Internal-consistency of three feature-derived per-CPA scores | Scores share underlying inputs and are not statistically independent |
|
||||
| Pixel-identical conservative positive capture (§III-K.4; v3.x; Script 40) | Trivial sanity check on the conservative positive anchor | Anchor is tautologically captured by any reasonable threshold |
|
||||
| LOOO firm-level reproducibility (§III-K.3; Scripts 36, 37) | Algorithmic stability of K=2 / K=3 partition across firm folds | Stability is necessary but not sufficient for classification validity |
|
||||
|
||||
No single tool in this collection provides ground-truth validation. Their conjunction constitutes the unsupervised validation ceiling that the v4.0 corpus admits.
|
||||
|
||||
**What v4.0 does not claim.** We do not claim a validated forensic detector or an autonomous classification system. We do not report False Rejection Rate, sensitivity, recall, EER, ROC-AUC, precision, or positive predictive value against ground truth, because no ground truth exists at the signature level. We do not interpret the deployed-rate excess of §III-L.6 as a presumed true-positive rate: that interpretation would require assuming that the within-firm same-CPA pool's collision rate equals the inter-CPA proxy rate in the absence of replication (i.e., that genuine same-CPA hand-signing would produce a collision rate no higher than random inter-CPA pairs). Two factors make the assumption unsafe: (a) a CPA who signs consistently can produce stylistically similar signatures across years that exceed inter-CPA similarity at the cosine axis; (b) within-firm template sharing (§III-L.4 cross-firm hit matrix; v3.x byte-level evidence of Firm A's pixel-identical signatures across partners) places a substantial inter-CPA collision floor that itself reflects template-like reuse rather than independent inter-CPA random matching. We do not infer that the within-firm collision concentration of §III-L.4 constitutes deliberate template sharing; we describe it as "inter-CPA collision concentration is within-firm" and treat the mechanism as an open empirical question.
|
||||
|
||||
**What v4.0 does claim.** The deployed signature-replication screening rule is characterised at three units of analysis (per-comparison, per-signature pool, per-document) against an inter-CPA negative-anchor coincidence-rate calibration. The per-comparison rates ($\leq 0.0006$ at cos $> 0.95$; $\leq 0.0013$ at dHash $\leq 5$; $\leq 0.00014$ jointly) are specificity-proxy-anchored operating points consistent with biometric-verification convention, with the proxy nature recorded in §III-L.0 and §III-M. The per-signature and per-document rates ($0.11$ and $0.34$ respectively under the deployed any-pair HC + MC alarm) are operationally meaningful **alarm-yield** indicators rather than true error rates. Per-firm rates show substantial heterogeneity (Firm A's per-document HC + MC alarm at $0.62$ vs Firm B/C/D at $0.09$–$0.16$), driven by firm-level rather than pool-size effects, and concentrated in within-firm cross-CPA candidate matching. The framework is positioned as a **specificity-proxy-anchored screening tool with human-in-the-loop review**, not as a validated forensic classifier.
|
||||
|
||||
**Specificity-alert-yield tradeoff.** Because sensitivity is unobservable, stakeholders cannot derive an operating point by optimising a ROC criterion. Instead, the specificity-proxy-anchored framework offers a *specificity-alert-yield tradeoff*: tighter operating points (e.g., cos $> 0.98$ AND dHash $\leq 3$) reduce both per-comparison ICCR (to $\approx 5 \times 10^{-5}$; §III-L.1 inversion) and per-signature alert yield (to $\approx 0.05$; §III-L.2), with an unknown effect on actual replication-detection recall. Tighter operating points are not necessarily preferable: any tightening reduces the alert rate but may also miss true replicated signatures whose noise has pushed them outside the tighter envelope. The deployment decision depends on the relative cost of manual review (per alarm) and missed-replication risk (per false negative) — neither directly observable from corpus data.
|
||||
|
||||
---
|
||||
|
||||
## Provenance table for numerical claims in §III-G through §III-L
|
||||
## Provenance table for key numerical claims in §III-G through §III-L
|
||||
|
||||
The table below lists the principal numerical claims and their data-source scripts. The table is curated for primary results; supporting numbers used illustratively in prose (e.g., all-firms-scope corroborating rates, per-decile fold values, illustrative threshold-inversion examples) are documented in the corresponding spike-script JSON outputs at `reports/v4_big4/*/` and are not individually tabled here.
|
||||
|
||||
| Claim | Value | Source | Notes |
|
||||
|---|---|---|---|
|
||||
| Big-4 CPA count, $n_{\text{sig}} \geq 10$ | $437$ ($171/112/102/52$) | Script 36 sample sizes; Script 38 per-firm summary | direct |
|
||||
| Big-4 signature count | $150{,}442$ | Script 39 (per-signature K=3 fit explicitly cites this $n$) | direct |
|
||||
| Big-4 signature count (descriptor-complete) | $150{,}442$ | Script 39 / 39b / 39d / 39e / 45 / 46 | analyses using pre-computed descriptors |
|
||||
| Big-4 signature count (vector-complete) | $150{,}453$ | Script 40b / 43 / 44 | analyses recomputing from feature + dHash vectors |
|
||||
| Non-Big-4 reference CPA count | $249$ | Script 38 reference population | direct |
|
||||
| Big-4 K=2 marginal crossings $(0.9755, 3.755)$ | direct | Script 34; Script 36 §A | direct |
|
||||
| Bootstrap 95% CI cosine $[0.9742, 0.9772]$ | direct | Script 34; Script 36 §A | $n_{\text{boot}} = 500$ |
|
||||
@@ -185,21 +368,75 @@ The conventions about these thresholds (cosine 0.95 as an operating point chosen
|
||||
| Pixel-identity Big-4 subset $n = 262$ ($145/8/107/2$) | direct | Script 40 sample | direct |
|
||||
| Full-dataset accountant count $n = 686$ | direct | Script 41 (`fulldataset_report.md`) | direct |
|
||||
| Positive-anchor miss rate $0\%$ on $n = 262$ (Wilson upper $1.45\%$) | direct | Script 40 results table | direct |
|
||||
| Inter-CPA FAR $0.0005$ at cos $> 0.95$ (Wilson 95% $[0.0003, 0.0007]$) | direct | v3 §IV-F.1 / Table X (inherited) | inherited from v3 |
|
||||
| Inter-CPA cos $> 0.95$ ICCR $0.0005$ (Wilson 95% $[0.0003, 0.0007]$) | inherited | v3 §IV-F.1 / Table X | v3 reported this as "FAR"; v4.0 reframes as inter-CPA coincidence rate per §III-L.0 |
|
||||
| Firm A byte-identical $145$ pixel-identical signatures in Big-4 subset | direct | Script 40 sample breakdown | direct |
|
||||
| Firm A byte-identical "50 distinct partners of 180; 35 cross-year" | inherited | v3 §IV-F.1 / Script 28 / Appendix B byte-decomposition output | **inherited from v3; not regenerated in v4.0 spike scripts** |
|
||||
| Big-4 K=3 per-firm C1 hard-assignment ($0\%$ / $8.9\%$ / $23.5\%$ / $11.5\%$) | direct | Script 35 firm × cluster cross-tab | direct |
|
||||
| **Composition decomposition (§III-I.4):** | | | |
|
||||
| Within-firm signature-level dip $p_{\text{cos}}$ Big-4 (A/B/C/D) | $0.176 / 0.991 / 0.551 / 0.976$ | Script 39b per-firm | direct, $n_{\text{boot}} = 2000$ |
|
||||
| Within-firm signature-level dip $p_{\text{cos}}$ non-Big-4 (10 firms, range) | $[0.59, 0.99]$ | Script 39c per-firm | direct, firms with $\geq 500$ signatures |
|
||||
| Within-firm jittered-dHash dip $p$ Big-4 (5 seeds, median) A/B/C/D | $0.999 / 0.996 / 0.999 / 0.9995$ | Script 39d multi-seed | uniform jitter $[-0.5, +0.5]$ |
|
||||
| Within-firm jittered-dHash dip $p$ non-Big-4 (5 seeds, range across 10 firms) | $[0.71, 1.00]$ | Script 39d / 39c | uniform jitter $[-0.5, +0.5]$ |
|
||||
| Big-4 pooled dHash dip $p$ raw / jittered (seed median) | $< 5 \times 10^{-4}$ / $< 5 \times 10^{-4}$ | Script 39d | jitter alone does not eliminate Big-4 pooled rejection |
|
||||
| Big-4 pooled dHash dip $p$ firm-centred + jittered (5-seed median) | $0.35$ | Script 39e 2×2 factorial | both corrections eliminate rejection ($0/5$ seeds at $\alpha = 0.05$) |
|
||||
| Big-4 firm-centred signature-level cos dip $p$ | $0.597$ | codex round-30 verification on Script 43 substrate | independent verification |
|
||||
| Big-4 firm-centred accountant-level cos\_mean dip $p$ | $1.0$ | codex round-30 verification | independent verification |
|
||||
| Per-firm Big-4 dHash mean (A/B/C/D) | $2.73 / 6.46 / 7.39 / 7.21$ | Script 39e per-firm summary | direct |
|
||||
| Big-4 integer-histogram valley near $\text{dHash} \approx 5$ within any firm | none in any of A/B/C/D | Script 39d valley analysis | bins $0$–$20$ |
|
||||
| **Anchor-based calibration (§III-L.1):** | | | |
|
||||
| Per-comparison ICCR cos $> 0.95$ Big-4 | $0.00060$ (Wilson 95% $[0.00053, 0.00067]$) | Script 40b | $5 \times 10^5$ inter-CPA pairs, Big-4 scope |
|
||||
| Per-comparison ICCR cos $> 0.945$ Big-4 | $0.00081$ (Wilson 95% $[0.00073, 0.00089]$) | Script 40b | direct |
|
||||
| Per-comparison ICCR cos $> 0.97$ / cos $> 0.98$ Big-4 | $0.00024$ / $0.00009$ | Script 40b | direct |
|
||||
| Per-comparison ICCR dHash $\leq 5$ Big-4 | $0.00129$ (Wilson 95% $[0.00120, 0.00140]$) | Script 40b | direct, v4 new |
|
||||
| Per-comparison ICCR dHash $\leq 4 / 3 / 2$ Big-4 | $0.00050 / 0.00019 / 0.00006$ | Script 40b | direct |
|
||||
| Per-comparison joint ICCR cos $> 0.95$ AND dHash $\leq 5$ Big-4 | $0.00014$ | Script 40b | any-pair semantics |
|
||||
| Per-comparison joint ICCR cos $> 0.95$ AND dHash $\leq 4$ Big-4 | $0.00011$ | Script 40b | any-pair semantics |
|
||||
| Conditional ICCR dHash $\leq 5$ given cos $> 0.95$ Big-4 | $0.234$ (Wilson 95% $[0.190, 0.285]$) | Script 40b | $70 / 299$ pairs |
|
||||
| All-firms per-comparison joint ICCR | $0.00007$ | Script 40b | corroborating scope |
|
||||
| **Pool-normalised per-signature alert rate (§III-L.2):** | | | |
|
||||
| Per-signature any-pair ICCR HC Big-4 | $0.1102$ (Wilson 95% $[0.1086, 0.1118]$; CPA-bootstrap 95% $[0.0908, 0.1330]$) | Script 43 | $n_{\text{sig}} = 150{,}453$ (vector-complete) |
|
||||
| Per-signature same-pair ICCR HC Big-4 | $0.0827$ (Wilson 95% $[0.0813, 0.0841]$; CPA-bootstrap 95% $[0.0668, 0.1021]$) | Script 43 | stricter alternative |
|
||||
| Per-firm any-pair ICCR HC (A/B/C/D) | $0.2594 / 0.0147 / 0.0053 / 0.0110$ | Script 43 per-firm | direct |
|
||||
| Per-firm same-pair ICCR HC (A/B/C/D) | $0.2018 / 0.0023 / 0.0019 / 0.0051$ | Script 43 per-firm | direct |
|
||||
| Pool-size decile 1 / decile 10 any-pair ICCR | $0.0249 / 0.1905$ | Script 43 decile table | broadly monotone with two minor reversals |
|
||||
| Per-signature tighter ICCR cos $> 0.95$ AND dHash $\leq 3$ same-pair Big-4 | $0.0449$ | Script 43 | optional stricter operating point |
|
||||
| **Document-level alert rate (§III-L.3):** | | | |
|
||||
| Document-level ICCR D1 (HC only) Big-4 | $0.1797$ (Wilson 95% $[0.1770, 0.1825]$) | Script 45 | $n_{\text{docs}} = 75{,}233$ |
|
||||
| Document-level ICCR D2 (HC + MC) Big-4 | $0.3375$ (Wilson 95% $[0.3342, 0.3409]$) | Script 45 | operational alarm definition |
|
||||
| Document-level ICCR D3 (HC + MC + HSC) Big-4 | $0.3384$ (Wilson 95% $[0.3351, 0.3418]$) | Script 45 | descriptive |
|
||||
| Per-firm document-level D2 ICCR (A/B/C/D) | $0.6201 / 0.1600 / 0.1635 / 0.0863$ | Script 45 per-firm | direct |
|
||||
| **Firm-heterogeneity logistic regression (§III-L.4):** | | | |
|
||||
| Logistic OR (Firm B / C / D vs A) | $0.053 / 0.010 / 0.027$ | Script 44 regression | controlling for log pool size; reference $=$ Firm A |
|
||||
| Logistic OR log(pool size, centred) | $4.01$ | Script 44 regression | pool-size effect after firm adjustment |
|
||||
| Cross-firm hit matrix Firm A source $\to$ Firm A candidate (any-pair) | $14{,}447 / 14{,}622$ | Script 44 cross-firm matrix | $98.8\%$ within-firm |
|
||||
| Cross-firm hit matrix same-pair within-firm rate (A/B/C/D) | $99.96\% / 97.7\% / 98.2\% / 97.0\%$ | Script 44 same-pair section | direct |
|
||||
| **Threshold-sensitivity (§III-L.5):** | | | |
|
||||
| Local / median gradient ratio cos $= 0.95$ | $\approx 25\times$ | Script 46 plateau diagnostic | descriptive, not formal plateau test |
|
||||
| Local / median gradient ratio dHash $= 5$ | $\approx 3.8\times$ | Script 46 plateau diagnostic | descriptive |
|
||||
| Local / median gradient ratio dHash $= 15$ | $\approx 0.08$ | Script 46 plateau diagnostic | MC/HSC boundary plateau-like |
|
||||
| **Observed deployed alert rate (§III-L.6):** | | | |
|
||||
| Per-signature observed-deployed HC rate Big-4 | $0.4958$ | Script 46 / Script 42 | actual same-CPA pools |
|
||||
| Per-document observed-deployed HC rate Big-4 | $0.6228$ | Script 46 | actual same-CPA pools |
|
||||
| Deployed-rate excess over inter-CPA proxy (per-sig HC) | $0.3856$ pp | derived | $0.4958 - 0.1102$ |
|
||||
| Deployed-rate excess over inter-CPA proxy (per-doc HC) | $0.4431$ pp | derived | $0.6228 - 0.1797$ |
|
||||
| **Sample-size reconciliation:** | | | |
|
||||
| Big-4 signatures with pre-computed descriptors | $150{,}442$ | Script 39 / 39b / 39d / 39e / 45 / 46 | descriptor-complete subset |
|
||||
| Big-4 signatures with feature + dHash vectors stored | $150{,}453$ | Script 40b / 43 / 44 | vector-complete subset |
|
||||
| Difference between the two counts | $11$ signatures | direct (descriptor-completion lag) | negligible at population scale |
|
||||
| Big-4 CPAs all (any signature count) | $468$ | Script 40b / 43 / 44 | direct |
|
||||
| Big-4 CPAs with $n_{\text{sig}} \geq 10$ for accountant-level stability | $437$ | Scripts 36 / 38 / 39 | accountant-level analysis threshold |
|
||||
|
||||
---
|
||||
|
||||
## Cross-reference index (author working checklist; remove before submission)
|
||||
|
||||
- **Big-4 sub-corpus definition** (§III-G) — 437 CPAs, 150,442 signatures.
|
||||
- **Big-4 sub-corpus definition** (§III-G) — 437 CPAs / $n_{\text{sig}} \geq 10$ at accountant-level, 468 CPAs / 150,442–150,453 signatures at signature-level (sample-size reconciliation in §III-G).
|
||||
- **Reference populations** (§III-H) — Firm A as templated-end case study; non-Big-4 ($n = 249$) as reverse-anchor reference (less-replicated population).
|
||||
- **Distributional characterisation** (§III-I) — Big-4 dip-test multimodality ($p < 5 \times 10^{-4}$); BD/McCrary null at Big-4 scope; mixture support.
|
||||
- **K=3 components, descriptive only** (§III-J) — C1 hand-leaning, C2 mixed, C3 replicated; LOOO supports descriptive use, not operational classification.
|
||||
- **Distributional diagnostics + composition decomposition** (§III-I) — Big-4 accountant-level dip-test rejection ($p < 5 \times 10^{-4}$); §III-I.4's 2×2 factorial decomposition (firm centring × integer jitter) shows the rejection is fully explained by between-firm location shift + integer mass-point artefacts; **no within-population bimodality and no natural threshold**.
|
||||
- **K=3 as descriptive firm-compositional partition** (§III-J) — C1/C2/C3 are descriptive positions on the descriptor plane reflecting Firm A vs others composition; not mechanism clusters; not used as operational classifier.
|
||||
- **Convergent internal-consistency** (§III-K) — three feature-derived scores ($\rho \geq 0.879$, not independent measurements); per-signature K=3 ($\kappa = 0.87$ vs per-CPA fit); K=2 LOOO unstable, K=3 LOOO partial; pixel-identity miss rate $0\%$ on $n = 262$.
|
||||
- **Per-document classifier** (§III-L) — inherited five-way rule retained as primary; K=3 demoted to characterisation only.
|
||||
- **Anchor-based threshold calibration + operational classifier** (§III-L) — inherited five-way rule retained; characterised by inter-CPA negative-anchor coincidence rates at per-comparison (§III-L.1: cos $> 0.95$ at $0.0006$, dHash $\leq 5$ at $0.0013$, joint at $0.00014$), per-signature pool (§III-L.2: $0.11$ any-pair HC), per-document (§III-L.3: HC $0.18$; HC+MC $0.34$); firm heterogeneity (§III-L.4) decisive after pool-size adjustment; within-firm cross-CPA collision concentration $\geq 97\%$; threshold-sensitivity analysis (§III-L.5) confirms HC threshold is locally sensitive, not plateau-stable; deployed-rate excess over proxy (§III-L.6) $\approx 38$ pp per-signature and $\approx 44$ pp per-document.
|
||||
- **Validation strategy and limitations** (§III-M) — multi-tool diagnostic collection (9 tools, each with disclosed untested assumption); positioning as anchor-calibrated screening framework with human-in-the-loop review, not as validated forensic detector; no FRR / sensitivity / EER / ROC-AUC reportable.
|
||||
|
||||
## Open questions remaining for partner / reviewer
|
||||
|
||||
|
||||
@@ -0,0 +1,162 @@
|
||||
# Paper A v4.0 Phase 4 Prose Draft v3 (post codex rounds 26–34)
|
||||
|
||||
> **Draft note (2026-05-13, Phase 4 v3; internal — remove before submission).** This file replaces the v3.20.0 Abstract, §I Introduction, §II Related Work, §V Discussion, and §VI Conclusion blocks with the v4.0 prose. The methodology and results sections (§III v7 and §IV v3.2 on this branch) are the technical foundation; Phase 4 prose aligns the narrative with the post-codex-round-34 framing. v3 (2026-05-13) reflects the major restructuring driven by codex rounds 29–34: distributional path to thresholds demolished (Scripts 39b–39e); anchor-based multi-level inter-CPA coincidence-rate calibration adopted (Scripts 40b, 43, 44, 45, 46); K=3 demoted to descriptive firm-compositional partition; "FAR" terminology replaced by "inter-CPA coincidence rate (ICCR)" throughout; nine-tool unsupervised validation strategy disclosed; positioning as anchor-calibrated screening framework with human-in-the-loop review (not validated forensic detector). Empirical anchors cite Scripts 32–46 on branch `paper-a-v4-big4`. Prior Phase 4 v2 changelog has been moved to `paper/v4/CHANGELOG.md`.
|
||||
|
||||
---
|
||||
|
||||
# Abstract
|
||||
|
||||
> *IEEE Access target: <= 250 words, single paragraph.*
|
||||
|
||||
Regulations require Certified Public Accountants (CPAs) to attest each audit report with a signature, but digitization makes reusing a stored signature image across reports — through administrative stamping or firm-level electronic signing — technically trivial and visually invisible, undermining individualized attestation. We build an end-to-end pipeline detecting such *non-hand-signed* signatures at scale: a Vision-Language Model identifies signature pages, YOLOv11 localizes signatures, ResNet-50 supplies deep features, and a dual-descriptor layer combines cosine similarity with an independent-minimum perceptual hash (dHash) to separate *style consistency* from *image reproduction*. Applied to 90,282 Taiwan audit reports (2013–2023), the pipeline yields 182,328 signatures from 758 CPAs; primary analyses are scoped to the Big-4 sub-corpus (437 CPAs; 150,442 signatures). Distributional diagnostics show that the apparent multimodality of the descriptor distribution dissolves under joint firm-mean centring and integer-tie jitter ($p$ rises to $0.35$), so no within-population bimodal antimode anchors the operational thresholds. We instead adopt an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units: per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ under the deployed any-pair high-confidence rule), and per-document ($0.34$ for the operational HC+MC alarm). Firm heterogeneity is decisive: Firm A's per-document HC+MC alarm rate is $0.62$ versus $0.09$–$0.16$ at Firms B/C/D after pool-size adjustment, with $98$–$100\%$ of inter-CPA collisions concentrated within the source firm — consistent with firm-level template-like reuse. We position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review, not as a validated forensic detector; no calibrated error rates are reportable without signature-level ground truth.
|
||||
|
||||
---
|
||||
|
||||
# I. Introduction
|
||||
|
||||
> *Target: ~1.5 pages double-column IEEE format. Double-blind: no author/institution info.*
|
||||
|
||||
Financial audit reports serve as a critical mechanism for ensuring corporate accountability and investor protection. In Taiwan, the Certified Public Accountant Act (會計師法 §4) and the Financial Supervisory Commission's attestation regulations (查核簽證核准準則 §6) require certifying CPAs to affix their signature or seal (簽名或蓋章) to each audit report [1]. While the law permits either a handwritten signature or a seal, the CPA's attestation on each report is intended to represent a deliberate, individual act of professional endorsement for that specific audit engagement [2].
|
||||
|
||||
The digitization of financial reporting has introduced a practice that complicates this intent. As audit reports are now routinely generated, transmitted, and archived as PDF documents, it is technically and operationally straightforward to reproduce a CPA's stored signature image across many reports rather than re-executing the signing act for each one. This reproduction can occur either through an administrative stamping workflow — in which scanned signature images are affixed by staff as part of the report-assembly process — or through a firm-level electronic signing system that automates the same step. We refer to signatures produced by either workflow collectively as *non-hand-signed*. Although this practice may fall within the literal statutory requirement of "signature or seal," it raises substantive concerns about audit quality, as an identically reproduced signature applied across hundreds of reports may not represent meaningful individual attestation for each engagement. The accounting literature has examined the audit-quality consequences of partner-level engagement transparency: studies of partner-signature mandates in the United Kingdom find measurable downstream effects [31], cross-jurisdictional evidence on individual partner signature requirements highlights similar quality channels [32], and Taiwan-specific evidence on mandatory partner rotation documents how individual-partner identification interacts with audit-quality outcomes [33]. Unlike traditional signature forgery, where a third party attempts to imitate another person's handwriting, non-hand-signing involves the legitimate signer's own stored signature being reused, and is visually invisible to report users at scale.
|
||||
|
||||
The distinction between *non-hand-signing detection* and *signature forgery detection* is conceptually and technically important. The extensive body of research on offline signature verification [3]–[8] focuses almost exclusively on forgery detection — determining whether a questioned signature was produced by its purported author. In our context, identity is not in question; the CPA is indeed the legitimate signer. The question is whether the physical act of signing occurred for each individual report, or whether a single signing event was reproduced as an image across many reports. This detection problem differs fundamentally from forgery detection: while it does not require modeling skilled-forger variability, it introduces the distinct challenge of separating legitimate intra-signer consistency from image-level reproduction.
|
||||
|
||||
A methodological concern shapes the research design. Many prior similarity-based classification studies rely on ad-hoc thresholds — declaring two images equivalent above a hand-picked cosine cutoff, for example — without principled statistical justification. Such thresholds are fragile in an archival-data setting. A defensible approach requires (i) explicit calibration of the operational thresholds against measurable negative-anchor evidence; (ii) diagnostic procedures that test whether the descriptor distribution itself supports a within-population threshold, including formal decomposition of apparent multimodality into between-group composition and integer-tie artefacts; (iii) annotation-free reporting of operational alarm rates at multiple analysis units (per-comparison, per-signature pool, per-document) with Wilson 95% confidence intervals; (iv) per-firm stratification of the reported rates to surface heterogeneity that aggregate metrics conceal; and (v) explicit disclosure of the unsupervised setting's limits — in particular, the inability to estimate true error rates without signature-level ground-truth labels.
|
||||
|
||||
Despite the significance of the problem for audit quality and regulatory oversight, no prior work has specifically addressed non-hand-signing detection in financial audit documents at scale with these methodological safeguards. Woodruff et al. [9] developed an automated pipeline for signature analysis in corporate filings for anti-money-laundering investigations, but their work focused on author clustering rather than detecting image reuse. Copy-move forgery detection methods [10], [11] address duplicated regions within or across images but are designed for natural images and do not account for the specific characteristics of scanned document signatures. Research on near-duplicate image detection using perceptual hashing combined with deep learning [12], [13] provides relevant methodological foundations but has not been applied to document forensics or signature analysis. From the statistical side, the methods we adopt for distributional characterisation — the Hartigan dip test [37] and finite mixture modelling via the EM algorithm [40], [41], complemented by a Burgstahler-Dichev / McCrary density-smoothness diagnostic [38], [39] — have been developed in statistics and accounting-econometrics but have not been combined as a joint diagnostic toolkit for document-forensics threshold characterisation.
|
||||
|
||||
In this paper we present a fully automated, end-to-end pipeline for detecting non-hand-signed CPA signatures in audit reports at scale, together with a multi-tool validation framework that explicitly discloses the unsupervised setting's limits. The pipeline processes raw PDF documents through (1) signature page identification with a Vision-Language Model; (2) signature region detection with a trained YOLOv11 object detector; (3) deep feature extraction via a pre-trained ResNet-50; (4) dual-descriptor similarity (cosine + independent-minimum dHash); (5) anchor-based threshold calibration at three units of analysis (per-comparison, pool-normalised per-signature, per-document) against an inter-CPA negative-anchor coincidence-rate proxy (§III-L); (6) firm-stratified per-rule reporting and a within-firm cross-CPA hit-matrix analysis (§III-L.4); (7) a composition decomposition that establishes the absence of a within-population bimodal antimode in the descriptor distributions (§III-I.4); and (8) a multi-tool unsupervised validation strategy with disclosed assumption-violation analysis (§III-M).
|
||||
|
||||
The methodological reframing relative to earlier versions of this work is central to our v4.0 contribution. Earlier work in this lineage adopted a distributional path to thresholds — fitting accountant-level finite-mixture models and treating their marginal crossings as data-derived "natural" thresholds. v4.0 reports a composition decomposition diagnostic (§III-I.4) that overturns this reading: the apparent multimodality of the Big-4 accountant-level distribution is fully explained by between-firm location-shift effects (Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$) and integer mass-point artefacts on the integer-valued dHash axis. Once both confounds are removed (firm-mean centring plus uniform integer jitter), the Big-4 pooled dHash dip test yields $p_{\text{median}} = 0.35$ across five jitter seeds, eliminating the rejection. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual mid/small firm with $\geq 500$ signatures (10 firms tested in Script 39c). The descriptor distributions therefore contain no within-population bimodal antimode that could anchor an operational threshold.
|
||||
|
||||
In place of distributional anchoring, v4.0 adopts an anchor-based inter-CPA coincidence-rate (ICCR) calibration. At the per-comparison unit, the inherited cos$>0.95$ operating point yields ICCR $= 0.00060$ on a $5 \times 10^5$-pair Big-4 sample (replicating v3.x's reported per-comparison rate of $0.0005$ under prior "FAR" terminology); the dHash$\leq 5$ structural cutoff yields ICCR $= 0.00129$ (v4 new); the joint rule cos$>0.95$ AND dHash$\leq 5$ yields joint ICCR $= 0.00014$ (any-pair semantics, matching the deployed extrema rule). At the pool-normalised per-signature unit, the same rule's effective coincidence rate is materially higher because the deployed classifier takes max-cosine and min-dHash over a same-CPA pool: pooled Big-4 any-pair ICCR is $0.1102$ (Wilson 95% CI $[0.1086, 0.1118]$; CPA-block bootstrap 95% $[0.0908, 0.1330]$). At the per-document unit, the operational HC$+$MC alarm fires on $33.75\%$ of Big-4 documents under the inter-CPA candidate-pool counterfactual.
|
||||
|
||||
The pooled per-signature and per-document rates conceal striking firm heterogeneity. A logistic regression of the per-signature hit indicator on firm dummies (Firm A reference) and centred log pool size yields odds ratios of $0.053$ (Firm B), $0.010$ (Firm C), and $0.027$ (Firm D) — Firms B/C/D are an order of magnitude below Firm A even after controlling for the pool-size confound (Script 44). Cross-firm hit matrix analysis shows that $98$–$100\%$ of inter-CPA collisions originate from candidates within the source firm (different CPA, same firm), consistent with firm-specific template, stamp, or document-production reuse mechanisms — though not by itself diagnostic of deliberate sharing. We retain the inherited Paper A v3.x five-way box rule as the operational classifier; v4.0's contribution is to characterise its multi-level coincidence behaviour against the inter-CPA negative anchor rather than to derive new thresholds.
|
||||
|
||||
Three feature-derived scores converge on the per-CPA descriptor-position ranking with Spearman $\rho \geq 0.879$ (Script 38): the K=3 mixture posterior (now interpreted as a firm-compositional position score, not a mechanism cluster posterior; §III-J), a reverse-anchor cosine percentile relative to a strictly-out-of-target non-Big-4 reference, and the inherited box-rule less-replication-dominated rate. The three scores are deterministic functions of the same per-CPA descriptor pair, so the convergence is documented as internal consistency among feature-derived ranks rather than external validation. Hard ground truth for the *replicated* class is provided by 262 byte-identical signatures in the Big-4 subset (Firm A 145, Firm B 8, Firm C 107, Firm D 2), against which all three candidate checks achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). For the box rule this result is close to tautological at byte-identity; we discuss the conservative-subset caveat in §V-G.
|
||||
|
||||
We apply this pipeline to 90,282 audit reports filed by publicly listed companies in Taiwan between 2013 and 2023, extracting and analyzing 182,328 individual CPA signatures from 758 unique accountants. The Big-4 sub-corpus comprises 437 CPAs and 150,442 signatures with both descriptors available.
|
||||
|
||||
The contributions of this paper are:
|
||||
|
||||
1. **Problem formulation.** We define non-hand-signing detection as distinct from signature forgery detection and frame it as a detection problem on intra-signer similarity distributions.
|
||||
|
||||
2. **End-to-end pipeline.** We present a pipeline that processes raw PDF audit reports through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor similarity computation, with automated inference and no manual intervention after initial training.
|
||||
|
||||
3. **Dual-descriptor verification.** We demonstrate that combining deep-feature cosine similarity with independent-minimum dHash resolves the ambiguity between *style consistency* and *image reproduction*, and we validate the backbone choice through a feature-backbone ablation.
|
||||
|
||||
4. **Composition decomposition disproves the distributional-threshold path.** We show via a 2×2 factorial diagnostic (firm-mean centring × integer-tie jitter) that the apparent multimodality of the Big-4 accountant-level descriptor distribution is fully attributable to between-firm location shifts and integer mass-point artefacts. The descriptor distributions contain no within-population bimodal antimode; "natural threshold" language in this lineage's prior work is not empirically supported.
|
||||
|
||||
5. **Anchor-based multi-level inter-CPA coincidence-rate calibration.** We characterise the deployed five-way classifier at three units of analysis: per-comparison ICCR (cos$>0.95$: $0.0006$; dHash$\leq 5$: $0.0013$; joint: $0.00014$), pool-normalised per-signature ICCR ($0.11$ for the deployed any-pair high-confidence rule), and per-document ICCR ($0.34$ for the operational HC$+$MC alarm). We adopt "inter-CPA coincidence rate" as the metric name throughout and reserve "False Acceptance Rate" for terminology that requires ground-truth negative labels, which the corpus does not provide.
|
||||
|
||||
6. **Firm heterogeneity quantification and within-firm cross-CPA collision concentration.** Per-firm rates differ by an order of magnitude after pool-size adjustment (Firm A's per-document HC$+$MC alarm at $0.62$ versus Firms B/C/D at $0.09$–$0.16$). Cross-firm hit matrix analysis shows that $98$–$100\%$ of inter-CPA collisions originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse mechanisms — a descriptive finding about deployed-rule behaviour, not a claim of deliberate template sharing.
|
||||
|
||||
7. **K=3 as descriptive firm-compositional partition; three-score convergent internal consistency.** We fit a K=3 Gaussian mixture as a descriptive partition of the Big-4 accountant-level distribution (no longer interpreted as three mechanism clusters). Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$; we report this as internal consistency rather than external validation, given that the scores share the underlying descriptor pair.
|
||||
|
||||
8. **Annotation-free positive-anchor validation and unsupervised validation ceiling.** We achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$) on 262 byte-identical Big-4 signatures, with the conservative-subset caveat that byte-identical pairs are by construction near cos$=1$ and dHash$=0$. We frame the overall validation strategy as a multi-tool collection of nine partial-evidence diagnostics, each with an explicitly disclosed untested assumption; their conjunction constitutes the unsupervised validation ceiling achievable on this corpus. We do not claim a validated forensic detector; we position the system as a specificity-proxy-anchored screening framework with human-in-the-loop review.
|
||||
|
||||
The remainder of the paper is organized as follows. Section II reviews related work on signature verification, document forensics, perceptual hashing, and the statistical methods used. Section III describes the proposed methodology. Section IV presents the experimental results — distributional characterisation, mixture fits, convergent internal-consistency checks, leave-one-firm-out reproducibility, pixel-identity validation, and full-dataset robustness. Section V discusses the implications and limitations. Section VI concludes with directions for future work.
|
||||
|
||||
---
|
||||
|
||||
# II. Related Work
|
||||
|
||||
> *Note for the Phase 4 review pass: §II is inherited substantively unchanged from v3.20.0 §II in the master manuscript, with one new paragraph added below. The unchanged content is not reproduced in this Phase 4 file; readers reviewing this draft should consult `paper/paper_a_related_work_v3.md` for the v3.20.0 §II text covering offline signature verification, near-duplicate detection, copy-move forgery detection, perceptual hashing, deep-feature similarity, and the statistical methods adopted (Hartigan dip test, finite mixture EM, Burgstahler-Dichev / McCrary density-smoothness diagnostic). The paragraph below is the only v4.0-specific §II addition.*
|
||||
|
||||
**Addition for v4.0: leave-one-firm-out cross-validation in a small-cluster scope.** Cross-validation methodology in the leave-one-out tradition has been developed extensively in statistics since Stone [42] and Geisser [43], and modern surveys including Vehtari et al. [44] discuss its application to mixture models. In document-forensics calibration the technique has been used selectively, typically with the individual document or signature as the hold-out unit. Our application in §III-K differs in two respects from the standard usage: (i) the hold-out unit is the *firm* (not the individual CPA or signature), so the analysis directly probes cross-firm reproducibility of the fitted mixture rather than within-firm sampling variance; and (ii) the held-out predictions are interpreted as a *composition-sensitivity band* on the candidate mixture boundary, not as a sufficiency claim for the inherited five-way operational classifier (which is calibrated separately; §III-L). We treat LOOO drift as descriptive information about how the mixture characterisation moves when training composition changes, not as a pass/fail test for the operational classifier. Numerical references [42]–[44] are placeholders in this draft and will be replaced with the project's preferred references at copy-edit time.
|
||||
|
||||
---
|
||||
|
||||
# V. Discussion
|
||||
|
||||
## A. Non-Hand-Signing Detection as a Distinct Problem
|
||||
|
||||
Non-hand-signing differs from forgery in that the questioned signature is produced by its legitimate signer's own stored image rather than by an impostor. The detection problem is therefore framed around *intra-signer image reproduction* rather than *inter-signer imitation*. This framing has analytical consequences. The within-CPA signature distribution is the analytical population of interest; the cross-CPA inter-class distribution is a *reference* against which intra-CPA similarity is interpreted, not the population to be modelled. This contrasts with most prior offline signature verification work, which treats genuine-versus-forged as the central two-class problem.
|
||||
|
||||
## B. Per-Signature Similarity is a Continuous Quality Spectrum; the Accountant-Level Multimodality is Composition-Driven
|
||||
|
||||
A central empirical finding of v3.x was that *per-signature* similarity does not admit a clean two-mechanism mixture: dip-test fails to reject unimodality at the signature level for Firm A, BIC prefers a 3-component fit, and BD/McCrary candidate transitions lie inside the high-similarity mode rather than between modes. v4.0 strengthens and extends this signature-level reading.
|
||||
|
||||
The Big-4 accountant-level descriptor distribution does reject unimodality on both marginals at $p < 5 \times 10^{-4}$ (Script 34). v4.0's composition decomposition (§III-I.4; Scripts 39b–39e) shows that this rejection is fully attributable to two non-mechanistic sources: (a) between-firm location-shift effects on both axes — Firm A's mean dHash of $2.73$ versus Firms B/C/D's $6.46$, $7.39$, $7.21$ creates a multi-peaked pooled distribution that any single firm's distribution lacks — and (b) integer mass-point artefacts on the integer-valued dHash axis, which inflate the dip statistic against a continuous-density null. A 2×2 factorial diagnostic applied to the Big-4 pooled dHash (firm-mean centring × uniform integer jitter $[-0.5, +0.5]$, 5 jitter seeds) shows that the dip test fails to reject ($p_{\text{median}} = 0.35$, 0/5 seeds reject) when *both* corrections are applied; either correction alone leaves the rejection in place. Within-firm signature-level cosine and jittered-dHash dip tests fail to reject in every individual Big-4 firm and in every individual non-Big-4 firm with $\geq 500$ signatures (10 firms tested). The descriptor distributions therefore lack a within-population bimodal antimode that could anchor an operational threshold. The K=2 / K=3 mixture fits are retained in §III-J as descriptive partitions of the joint Big-4 distribution that reflect firm-compositional structure, not as inferential evidence for two or three latent mechanism modes.
|
||||
|
||||
## C. Firm A as the Templated End of Big-4 (Case Study, Not Calibration Anchor)
|
||||
|
||||
Firm A is empirically the firm whose CPAs are most concentrated in the high-cosine, low-dHash corner of the Big-4 descriptor plane. In the Big-4 K=3 hard-posterior assignment (now interpreted as a firm-compositional position assignment; §III-J), Firm A accounts for $0\%$ of C1 (low-cos / high-dHash position) and $82.5\%$ of C3 (high-cos / low-dHash position); the opposite pattern holds at Firm C, which has the highest C1 concentration at $23.5\%$. Firm A also accounts for 145 of the 262 byte-identical signatures in the Big-4 byte-identical anchor of §IV-H (with Firm B 8, Firm C 107, Firm D 2). The additional v3.x finding that the 145 Firm A pixel-identical signatures span 50 distinct Firm A partners (of 180 registered), with 35 byte-identical matches across different fiscal years, is inherited from v3.20.0 §IV-F.1 / Script 28 / Appendix B byte-decomposition output and was not regenerated in v4.0's spike scripts; we retain those numbers by reference.
|
||||
|
||||
In v4.0 we treat Firm A as a *templated-end case study* rather than as the calibration anchor for the operational threshold. Firm A enters the Big-4 anchor-based ICCR calibration on equal footing with the other three Big-4 firms (§III-L). The cross-firm hit matrix of §III-L.4 strengthens this framing: $98$–$100\%$ of inter-CPA collisions originate from candidates within the source firm, regardless of which Big-4 firm is the source. Firm A's high per-document HC$+$MC alarm rate of $0.62$ (versus Firms B/C/D's $0.09$–$0.16$) reflects high inter-CPA collision concentration under the deployed rule on real same-CPA pools, consistent with firm-specific template, stamp, or document-production reuse — though the inter-CPA-anchor analysis alone is not diagnostic of deliberate template sharing. The byte-level evidence of v3.x §IV-F.1 (Firm A's 145 pixel-identical signatures across $\sim 50$ distinct partners) provides direct evidence that firm-level template reuse does occur at Firm A; the within-firm collision pattern at all four Big-4 firms is consistent with that mechanism extending in milder form to Firms B/C/D.
|
||||
|
||||
## D. K=2 / K=3 as Descriptive Firm-Compositional Partitions
|
||||
|
||||
Leave-one-firm-out cross-validation of the Big-4 mixture fit reveals a sharp contrast between K=2 and K=3 behaviour. K=2 is unstable: across-fold cosine-crossing deviation is $0.028$, and holding Firm A out gives a fold rule (cos $> 0.938$, dHash $\leq 8.79$) that classifies $100\%$ of held-out Firm A in the upper component, while holding any non-Firm-A Big-4 firm out gives a fold rule near (cos $> 0.975$, dHash $\leq 3.76$) that classifies $0\%$ of the held-out firm in the upper component. The K=2 boundary is essentially a Firm-A-vs-others separator — direct evidence that the K=2 partition reflects firm-compositional rather than mechanistic structure.
|
||||
|
||||
K=3 in contrast has a *reproducible component shape* at the descriptor-position level: across the four folds the C1 (low-cos / high-dHash) component cosine mean varies by at most $0.005$, the dHash mean by at most $0.96$, and the weight by at most $0.023$. Hard-posterior membership for the held-out firm is composition-sensitive (absolute differences $1.8$–$12.8$ pp across folds). Together with the §III-I.4 composition decomposition (no within-population bimodal antimode), the K=3 stability supports a descriptive reading: the Big-4 descriptor plane has a reproducible three-region partition that reflects how firm-compositional weight is distributed across the descriptor space, *not* a three-mechanism latent-class structure. We accordingly do not use K=3 hard-posterior membership as an operational classifier; we use it as the accountant-level descriptive summary that complements the deployed signature-level five-way classifier of §III-L.
|
||||
|
||||
## E. Three-Score Convergent Internal-Consistency
|
||||
|
||||
Three feature-derived scores agree on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$: the K=3 mixture posterior (a firm-compositional position score, not a mechanism cluster posterior); the reverse-anchor cosine percentile under a non-Big-4 reference distribution; and the inherited Paper A box-rule less-replication-dominated rate. The three scores are *not* statistically independent measurements — they are deterministic functions of the same per-CPA descriptor pair — so the convergence is documented as internal consistency rather than external validation against an independent ground truth (which the corpus does not provide for the hand-signed class). The strength of the convergence (all pairwise $|\rho| > 0.87$) and its persistence at the signature level (Cohen $\kappa = 0.87$ between per-CPA-fit and per-signature-fit K=3 binary labels) are nevertheless informative: per-CPA aggregation does not collapse the broad three-region ordering, and three different summarisations of the descriptor space produce broadly concordant per-CPA rankings, with a residual non-Firm-A disagreement (the reverse-anchor cosine percentile ranks Firm D fractionally above Firm C, while the mixture posterior and the box-rule rate rank Firm C highest among non-Firm-A firms).
|
||||
|
||||
## F. Anchor-Based Multi-Level Calibration
|
||||
|
||||
The operational specificity of the deployed five-way classifier is characterised at three units of analysis (§III-L), all against the same inter-CPA negative-anchor coincidence-rate proxy. The per-comparison ICCR replicates v3.x's per-comparison rate (cos$>0.95 \to 0.00060$) and extends it to the structural dimension (dHash$\leq 5 \to 0.00129$; joint $\to 0.00014$). The pool-normalised per-signature ICCR captures the deployed rule's effective per-signature rate under inter-CPA candidate-pool replacement ($0.1102$ pooled Big-4 any-pair HC), exposing that the per-comparison rate is not the deployed-rule rate at the per-signature classifier level: the deployed classifier takes max-cosine and min-dHash over a same-CPA pool of size $n_{\text{pool}}$, so the inter-CPA-equivalent rate scales approximately as $1 - (1 - p_{\text{pair}})^{n_{\text{pool}}}$ in the independence limit. The per-document ICCR aggregates to operational alarm-rate units: HC alone $0.18$, the operational HC$+$MC alarm $0.34$.
|
||||
|
||||
Two additional findings refine the calibration story. First, the per-pair conditional ICCR for dHash$\leq 5$ given cos$>0.95$ is $0.234$ (Wilson 95% $[0.190, 0.285]$): given the cosine gate, the structural dimension provides further per-comparison specificity at $\sim 4.3\times$ refinement. Second, the alert-rate sensitivity analysis (§III-L.5; Script 46) shows the inherited HC threshold is locally sensitive rather than plateau-stable (local gradient $\approx 25\times$ the median for cosine, $\approx 3.8\times$ for dHash); stakeholders requiring different specificity-alert-yield operating points can derive thresholds by inverting the ICCR curves (a tighter rule cos$>0.95$ AND dHash$\leq 3$ on the same-pair joint gives per-signature ICCR $\approx 0.045$). The MC/HSC sub-band boundary at dHash$=15$, by contrast, *is* plateau-like (local-to-median ratio $\approx 0.08$), consistent with high-dHash-tail saturation.
|
||||
|
||||
## G. Pixel-Identity as a Hard Positive Anchor; Inherited Inter-CPA Negative Anchor Reframed as Coincidence Rate
|
||||
|
||||
The only hard ground-truth subset in the corpus is pixel-identical signatures: those whose nearest same-CPA match is byte-identical after crop and normalisation. Independent hand-signing cannot produce byte-identical images, so these signatures are conservative-subset ground truth for the *replicated* class. On the Big-4 subset ($n = 262$ pixel-identical signatures), all three candidate classifiers — the inherited box rule, the K=3 hard label, and the reverse-anchor metric with a prevalence-calibrated cut — achieve $0\%$ positive-anchor miss rate (Wilson 95% upper bound $1.45\%$). We caution that this result is necessary but not sufficient: for the box rule it is close to tautological, because byte-identical neighbours have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region. The corresponding signature-level *negative* anchor evidence is developed in §III-L.1 above (v4 spike: cos$>0.95$ per-comparison ICCR $= 0.00060$, replicating v3.20.0's reported $0.0005$ under prior "FAR" terminology). We frame the per-comparison rate as a specificity proxy under the assumption that inter-CPA pairs constitute a clean negative anchor, and we document in §III-L.4 that this assumption is partially violated by within-firm cross-CPA template-like collision structures.
|
||||
|
||||
## G. Limitations
|
||||
|
||||
Several limitations should be transparent. The first nine are v4.0-specific; the last five are inherited from v3.20.0 §V-G and still apply to the v4.0 pipeline.
|
||||
|
||||
*No signature-level ground truth; no true error rates reportable.* The corpus does not contain labelled hand-signed or replicated classes at the signature level. We therefore cannot report False Rejection Rate, sensitivity, recall, Equal Error Rate, ROC-AUC, precision, or positive predictive value against ground truth. All quantitative rates reported in §III-L are inter-CPA negative-anchor coincidence rates (ICCRs) under the assumption that inter-CPA pairs constitute a clean negative anchor; this is a specificity proxy, not a calibrated specificity (§III-M).
|
||||
|
||||
*Inter-CPA negative-anchor assumption is partially violated.* The cross-firm hit matrix of §III-L.4 shows that $98$–$100\%$ of inter-CPA collisions under the deployed rule originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse. The inter-CPA-as-negative assumption is therefore not exactly satisfied — some inter-CPA pairs may share firm-level templates rather than being independent random matches. Our reported per-comparison ICCRs are best read as specificity-proxy rates under a partially-violated assumption, not as calibrated FARs.
|
||||
|
||||
*Scope.* The v4.0 primary analyses are scoped to the Big-4 sub-corpus. We did not perform the full per-signature pool-normalised ICCR analysis at the full $n = 686$ scope; the §IV-K full-dataset Spearman re-run shows the K=3 $+$ box-rule rank-convergence is preserved at $n = 686$ but does not validate the Big-4 operational ICCRs, the LOOO firm-fold structure, or the five-way operational classifier at the broader scope.
|
||||
|
||||
*Pixel-identity is a conservative subset.* Byte-identical pairs are the easiest replicated cases, and for the inherited box rule the positive-anchor miss rate against byte-identical pairs is close to tautological (byte-identical $\Rightarrow$ cosine $\approx 1$, dHash $\approx 0$, well inside the high-confidence box). A score that fails the pixel-identity check would be disqualified, but passing the check does not guarantee correct behaviour on the broader replicated population (e.g., re-stamped or noisy-template-variant signatures).
|
||||
|
||||
*Inherited rule components are not separately v4-validated.* The five-way classifier's moderate-confidence band (cos $> 0.95$ AND $5 < \text{dHash} \leq 15$), the style-consistency band ($\text{dHash} > 15$), and the document-level worst-case aggregation rule retain their v3.20.0 calibration and capture-rate evidence; v4.0's anchor-based ICCR calibration covers the binary high-confidence sub-rule (and its tightening alternatives such as dHash$\leq 3$), and the alert-rate sensitivity analysis (§III-L.5) characterises only the HC threshold. The MC and HSC sub-band boundaries are not separately re-validated by v4.0's diagnostic battery.
|
||||
|
||||
*Deployed-rate excess is not a presumed true-positive rate.* The $\sim 44$-pp per-document gap between the observed deployed alert rate (HC: $0.62$ on real same-CPA pools) and the inter-CPA proxy rate (HC: $0.18$) cannot be interpreted as a presumed true-positive rate without additional assumptions that §III-M shows are unsafe (consistent within-CPA signing can exceed inter-CPA similarity at the cosine axis; within-firm template sharing inflates the inter-CPA proxy baseline). The gap is best read as a same-CPA repeatability signal.
|
||||
|
||||
*A1 pair-detectability stipulation.* The per-signature detector requires at least one same-CPA pair to be near-identical when a CPA uses image replication. A1 is plausible for high-volume stamping or firm-level electronic signing but not guaranteed when a corpus contains only one observed replicated report for a CPA, multiple template variants used in parallel, or scan-stage noise that pushes a replicated pair outside the detection regime.
|
||||
|
||||
*K=3 hard-posterior membership is composition-sensitive.* The K=3 hard-posterior membership for any single firm varies by up to $12.8$ pp across LOOO folds. This is documented as a composition-sensitivity band rather than failure, but it means K=3 hard labels are not used as v4.0 operational classifier output; they are reported only as accountant-level descriptive characterisation.
|
||||
|
||||
*No partner-level mechanism attribution.* v4.0 reports population-level patterns; it does not perform partner-level mechanism attribution or report-level claims of intent. The signature-level outputs are signature-level quantities throughout. The within-firm cross-CPA collision concentration of §III-L.4 is consistent with template-like reuse but is not by itself diagnostic of deliberate sharing.
|
||||
|
||||
*Transferred ImageNet features (inherited from v3.20.0).* The ResNet-50 feature extractor uses pre-trained ImageNet weights without signature-domain fine-tuning. While our backbone-ablation study (§IV-L, inherited from v3.20.0 §IV-I) and prior literature support the effectiveness of transferred ImageNet features for signature comparison, a signature-domain fine-tuned feature extractor could improve discriminative performance.
|
||||
|
||||
*Red-stamp HSV preprocessing artifacts (inherited from v3.20.0).* The red stamp removal preprocessing uses simple HSV color-space filtering, which may introduce artifacts where handwritten strokes overlap with red seal impressions. Blended pixels are replaced with white, potentially creating small gaps in signature strokes that could reduce dHash similarity. This bias would push classifications toward false negatives rather than false positives.
|
||||
|
||||
*Longitudinal scan / PDF / compression confounds (inherited from v3.20.0).* Scanning equipment, PDF generation software, and compression algorithms may have changed over the 2013–2023 study period, potentially affecting similarity measurements. While cosine similarity and dHash are designed to be robust to such variations, longitudinal confounds cannot be entirely excluded.
|
||||
|
||||
*Source-exemplar misattribution in max/min pair logic (inherited from v3.20.0).* The max-cosine / min-dHash detection logic treats both ends of a near-identical same-CPA pair as non-hand-signed. In the rare case where one of the two documents contains a genuinely hand-signed exemplar that was subsequently reused as a stamping or e-signature template, the pair correctly identifies image reuse but misattributes non-hand-signed status to the source exemplar. This affects at most one source document per template variant per CPA and is not expected to be common.
|
||||
|
||||
*Legal and regulatory interpretation (inherited from v3.20.0).* Whether non-hand-signing of a CPA's own stored signature constitutes a violation of signing requirements is a jurisdiction-specific legal question. Our technical analysis can inform such determinations but cannot resolve them.
|
||||
|
||||
---
|
||||
|
||||
# VI. Conclusion and Future Work
|
||||
|
||||
We present a fully automated pipeline for detecting non-hand-signed CPA signatures in Taiwan-listed financial audit reports and a multi-tool framework for characterising and disclosing its operational behaviour at the Big-4 sub-corpus scope. The pipeline processes raw PDFs through VLM-based page identification, YOLO-based signature detection, ResNet-50 feature extraction, and dual-descriptor (cosine + independent-minimum dHash) similarity computation. The operational output is an inherited Paper A five-way per-signature classifier with worst-case document-level aggregation (§III-L). Applied to 90,282 audit reports filed between 2013 and 2023, the pipeline extracts 182,328 signatures from 758 CPAs, with the Big-4 sub-corpus (437 CPAs at accountant level; 150,442–150,453 signatures at signature level) as the primary analytical population.
|
||||
|
||||
Our central methodological contributions are: (1) a composition decomposition (Scripts 39b–39e) that establishes the absence of a within-population bimodal antimode in the Big-4 descriptor distribution: the apparent multimodality dissolves under joint firm-mean centring and integer-tie jitter ($p_{\text{median}} = 0.35$), so distributional "natural-threshold" framings of the inherited operating points are not empirically supported; (2) an anchor-based inter-CPA coincidence-rate (ICCR) calibration at three units of analysis — per-comparison ($0.0006$ at cos$>0.95$; $0.0013$ at dHash$\leq 5$; $0.00014$ jointly), pool-normalised per-signature ($0.11$ for the deployed any-pair HC rule), and per-document ($0.34$ for the operational HC$+$MC alarm) — with explicit terminological replacement of "FAR" by "ICCR" given the unsupervised setting; (3) firm heterogeneity quantification: logistic regression with pool-size adjustment gives odds ratios $0.053$, $0.010$, $0.027$ for Firms B/C/D relative to Firm A reference, indicating a large multiplicative effect that pool-size differences do not explain; (4) cross-firm hit matrix evidence that $98$–$100\%$ of inter-CPA collisions under the deployed rule originate from candidates within the source firm, consistent with firm-specific template, stamp, or document-production reuse mechanisms; (5) K=3 mixture demoted from "three mechanism clusters" to a descriptive firm-compositional partition; (6) three feature-derived scores converging on the per-CPA descriptor-position ranking at Spearman $\rho \geq 0.879$, reported as internal consistency rather than external validation; (7) $0\%$ positive-anchor miss rate on 262 byte-identical Big-4 signatures with the conservative-subset caveat; and (8) a nine-tool unsupervised-validation collection (§III-M) that explicitly discloses each tool's untested assumption and positions the system as an anchor-calibrated screening framework with human-in-the-loop review, not as a validated forensic detector.
|
||||
|
||||
Future work falls in four directions. *First*, a small-scale human-rated validation set would enable direct ROC optimisation and provide signature-level ground truth that v4.0 fundamentally lacks; without such ground truth, no true error rates can be reported. *Second*, the within-firm collision concentration documented in §III-L.4 (98–100% same-firm partners) invites a separate study to distinguish deliberate template sharing from passive firm-level production artefacts (shared scanners, common form templates, identical report-generation infrastructure) — a question the inter-CPA-anchor analysis alone cannot resolve. *Third*, the descriptive Firm A versus Firms B/C/D contrast (per-document HC$+$MC alarm $0.62$ vs $0.09$–$0.16$) — together with v3.x's byte-level evidence of 145 pixel-identical signatures across $\sim 50$ distinct Firm A partners — invites a companion analysis examining whether such firm-level signing patterns correlate with established audit-quality measures. *Fourth*, generalisation to mid- and small-firm contexts requires extending the anchor-based ICCR framework to scopes where firm-level LOOO folds are not available; the §III-I.4 composition diagnostics already document that the absence of within-population bimodality is corpus-universal, so the v4.0 calibration approach in principle generalises, but a full extension with cluster-robust uncertainty quantification is left as future work.
|
||||
|
||||
---
|
||||
|
||||
## Notes for Phase 4 close-out
|
||||
|
||||
Items remaining for the Phase 4 close-out pass before §I, §II, §V, §VI prose can be moved into the manuscript master file:
|
||||
|
||||
1. **Abstract word count.** Current draft is 243–244 words (shell `wc -w` on the paragraph returns 243; one-token tokenization difference depending on counter); both satisfy IEEE Access's $\leq 250$ word constraint with $\sim 6$ words of margin.
|
||||
2. **§I contributions list (8 items).** v3.20.0's contribution list had 7 items; v4.0's has 8 to reflect the Big-4 scope, K=3 descriptive role, and three-score convergence as separate contributions. Confirm whether the journal style supports 8 contributions or whether items can be merged.
|
||||
3. **§II Related Work LOOO citation.** A standard cross-validation citation for the LOOO addition is flagged "[add citation]" in the draft and needs to be filled with a specific reference (Geisser 1975 / Stone 1974 / a modern survey).
|
||||
4. **§V-G Limitations.** The seven limitations are listed flat; the journal style may prefer them grouped (scope vs ground-truth vs methodology) — consider reorganisation at copy-edit time.
|
||||
5. **§VI Future Work directions.** Four directions are listed; the third (audit-quality companion analysis) ties to the Paper B placeholder in the project memory and should be cross-checked for consistency with the planned Paper B framing.
|
||||
6. **Internal draft note + this close-out checklist.** Strip before submission packaging, per the across-paper "internal — remove before submission" policy applied to §III v6 and §IV v3.2 draft notes.
|
||||
@@ -1,4 +1,4 @@
|
||||
# Section IV. Results — v4.0 Draft v3.2 (post codex rounds 21–25)
|
||||
# Section IV. Results — v4.0 Draft v3.3 (post codex rounds 21–34)
|
||||
|
||||
> **Draft note (2026-05-12, v3.2; internal — remove before submission).** This file replaces the §IV-A through §IV-H block of `paper/paper_a_results_v3.md` (v3.20.0) with the Big-4 reframed structure. Section IV expands from 8 sub-sections in v3.20.0 to 12 sub-sections in v4.0 (A through L) to mirror the §III-G..L lineage. **Table-numbering scheme**: the v4 manuscript uses Tables V through XVIII (plus Table XV-B for document-level worst-case counts) for the new v4 Big-4 results; inherited v3.x tables are cited only as "v3.20.0 Table N" with their original v3 number and are *not* renumbered into the v4 sequence. No v4 Table IV is printed; the inherited v3.20.0 Table IV (per-firm detection counts) remains a v3.x reference rather than a v4 table. **Anonymisation**: the Big-4 firms are pseudonymously labelled Firm A through Firm D throughout the manuscript body; real names are not printed in v4 tables or prose. The v3 → v3.1 → v3.2 revision history is: v3 (post round 23) made the table-numbering scheme and anonymisation policy decisions and applied 14 presentation fixes; v3.1 (post round 24) tightened the close-out checklist; v3.2 (post round 25) finalises this draft note. Empirical anchors trace to Scripts 32–42 on branch `paper-a-v4-big4`; the §III provenance table covers the methodology-side citations and §IV adds new tables for the v4.0-specific results.
|
||||
|
||||
@@ -20,7 +20,7 @@ The all-pairs intra-vs-inter class distribution analysis (KDE crossover at $\ove
|
||||
|
||||
## D. Big-4 Accountant-Level Distributional Characterisation
|
||||
|
||||
This section reports the empirical evidence for §III-I's three-diagnostic distributional characterisation at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34; cross-citations to the v3.x (signature-level) analysis are noted where the v4.0 result differs structurally from the v3.x result.
|
||||
This section reports the empirical evidence for §III-I's distributional diagnostics at the Big-4 accountant level. All numbers below are direct re-statements from Scripts 32 / 34. The accountant-level dip-test rejection reported in Table V is, per §III-I.4 (Scripts 39b–39e), fully attributable to between-firm location shifts and integer mass-point artefacts rather than to within-population bimodality; the v4-new composition-decomposition diagnostics that establish this finding are tabulated in §IV-M below alongside the anchor-based ICCR calibration.
|
||||
|
||||
**Table V.** Hartigan dip-test results, accountant-level marginals (Big-4 primary; comparison scopes from Script 32).
|
||||
|
||||
@@ -48,12 +48,12 @@ The Big-4-scope null on both axes is consistent with the §IV-E mixture evidence
|
||||
|
||||
This section reports the K=2 and K=3 2D Gaussian mixture fits to the Big-4 accountant-level distribution and the bootstrap stability of their marginal crossings.
|
||||
|
||||
**Table VII.** Big-4 K=2 mixture components and marginal-crossing bootstrap 95% confidence intervals.
|
||||
**Table VII.** Big-4 K=2 mixture components (descriptive partition; not mechanism clusters per §III-J) and marginal-crossing bootstrap 95% confidence intervals.
|
||||
|
||||
| K=2 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight |
|
||||
|---|---|---|---|
|
||||
| Hand-leaning | 0.954 | 7.14 | 0.689 |
|
||||
| Replicated | 0.983 | 2.41 | 0.311 |
|
||||
| K=2-a (low-cos / high-dHash position) | 0.954 | 7.14 | 0.689 |
|
||||
| K=2-b (high-cos / low-dHash position) | 0.983 | 2.41 | 0.311 |
|
||||
|
||||
Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):
|
||||
|
||||
@@ -64,13 +64,13 @@ Marginal crossings (point + bootstrap 95% CI, $n_{\text{boot}} = 500$):
|
||||
|
||||
$\text{BIC}(K{=}2) = -1108.45$ (Script 34).
|
||||
|
||||
**Table VIII.** Big-4 K=3 mixture components.
|
||||
**Table VIII.** Big-4 K=3 mixture components (descriptive firm-compositional partition per §III-J; not mechanism clusters).
|
||||
|
||||
| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive label |
|
||||
| K=3 component | $\overline{\text{cos}}$ | $\overline{\text{dHash}}$ | weight | descriptive position |
|
||||
|---|---|---|---|---|
|
||||
| C1 | 0.9457 | 9.17 | 0.143 | hand-leaning |
|
||||
| C2 | 0.9558 | 6.66 | 0.536 | mixed |
|
||||
| C3 | 0.9826 | 2.41 | 0.321 | replicated |
|
||||
| C1 | 0.9457 | 9.17 | 0.143 | low-cos / high-dHash corner |
|
||||
| C2 | 0.9558 | 6.66 | 0.536 | central region |
|
||||
| C3 | 0.9826 | 2.41 | 0.321 | high-cos / low-dHash corner |
|
||||
|
||||
$\text{BIC}(K{=}3) = -1111.93$, lower than $K{=}2$ by $3.48$ (mild support; not by itself decisive). The full-fit K=3 baseline above is reproduced in Scripts 35, 37, and 38 with identical hyperparameters; Script 37 additionally fits K=3 on each leave-one-firm-out training set (those fold-specific components differ from the full-fit baseline by design and are reported separately in §IV-G Table XIII). Operational use of the K=2 / K=3 fits is governed by §III-J and §III-L; §IV-G reports the LOOO reproducibility evidence that motivates reporting both fits descriptively.
|
||||
|
||||
@@ -154,9 +154,11 @@ This section reports the only hard-ground-truth subset analysis available in the
|
||||
|
||||
We caution that for the Paper A box rule this result is close to tautological (byte-identical nearest-neighbour signatures have cosine $\approx 1$ and dHash $\approx 0$, well inside the rule's high-confidence region); v3.20.0 §V-F discusses this conservative-subset caveat at length and we retain that discussion. The reverse-anchor cut is chosen by *prevalence calibration* against the inherited box rule's overall replicated rate of $49.58\%$ across Big-4 signatures; this is a documented v4.0 limitation since no signature-level hand-signed ground truth exists to permit direct ROC optimisation.
|
||||
|
||||
## I. Inter-CPA Negative-Anchor FAR (inherited from v3.x)
|
||||
## I. Inter-CPA Pair-Level Coincidence Rate (Big-4 spike + inherited corpus-wide)
|
||||
|
||||
The signature-level inter-CPA negative-anchor FAR analysis (~50,000 random pairs from different CPAs; v3.20.0 §IV-F.1, Table X) is inherited unchanged. The v3.x result, reproduced here for reference: at the operational cosine cut of $0.95$, the inter-CPA FAR is $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$). v4.0 does not regenerate this analysis on the Big-4 subset; the inter-CPA negative-anchor logic is corpus-wide and the v3.x FAR remains the operational specificity reference for the §III-L operational rule.
|
||||
The signature-level inter-CPA pair-level coincidence-rate analysis (reported in v3.20.0 §IV-F.1, Table X as "FAR") is inherited and extended in v4.0. v4.0 retroactively reframes the metric as **inter-CPA pair-level coincidence rate (ICCR)** rather than "False Acceptance Rate" because the corpus does not provide signature-level ground-truth negative labels; the inter-CPA negative-anchor assumption underpinning the metric is itself partially violated by within-firm cross-CPA template-like collision structures (§III-L.4). The v3.20.0 corpus-wide spike on $\sim 50{,}000$ inter-CPA pairs reported a per-comparison rate of $0.0005$ (Wilson 95% CI $[0.0003, 0.0007]$) at the cosine cut $0.95$.
|
||||
|
||||
v4.0 additionally reports the §III-L.1 Big-4-scope spike at higher sample size ($5 \times 10^5$ inter-CPA pairs; Script 40b), which replicates and extends the v3 result and adds the structural dimension (dHash) and joint-rule rates. The §III-L.1 numbers are referenced rather than duplicated here; the consolidated v4-new ICCR calibration appears in §IV-M Table XVI.
|
||||
|
||||
## J. Five-Way Per-Signature + Document-Level Classification Output
|
||||
|
||||
@@ -255,6 +257,109 @@ This section reports the v4.0 reproducibility cross-check at the full accountant
|
||||
|
||||
The feature-backbone ablation (v3.20.0 Table XVIII; backbone replacement of ResNet-50 with alternative ImageNet-pretrained backbones to verify that the §III-E embedding choice is not load-bearing) is inherited unchanged. v3.20.0 Table XVIII is cited by its original v3 number and is **not** the same table as the v4 Table XVIII (which reports the Big-4 vs full-dataset Spearman drift in §IV-K). v4.0 makes no scope-specific re-derivation of the ablation; the analysis is a methodological-stability check on the embedding stage and is corpus-wide rather than Big-4-restricted.
|
||||
|
||||
## M. v4-New Anchor-Based ICCR Calibration Results
|
||||
|
||||
This section consolidates the v4-new empirical results that support the §III-L anchor-based threshold calibration framework. Numbers below are direct re-statements from the spike scripts cited per row; the corresponding §III provenance table entries appear in §III's provenance table.
|
||||
|
||||
### M.1 Composition decomposition (Scripts 39b–39e)
|
||||
|
||||
**Table XIX.** Within-firm and between-firm decomposition of the Big-4 accountant-level dip-test rejection.
|
||||
|
||||
| Diagnostic | Scope | Statistic | Implication |
|
||||
|---|---|---|---|
|
||||
| Within-firm signature-level cosine dip | Big-4 (4 firms) | $p_{\text{cos}} \in \{0.176, 0.991, 0.551, 0.976\}$ | 0/4 firms reject; cosine within-firm unimodal |
|
||||
| Within-firm signature-level cosine dip | non-Big-4 (10 firms $\geq 500$ sigs) | $p_{\text{cos}} \in [0.59, 0.99]$ | 0/10 firms reject; cosine within-firm unimodal |
|
||||
| Within-firm jittered-dHash dip (5 seeds, median) | Big-4 (4 firms) | $p_{\text{med}} \in \{0.999, 0.996, 0.999, 0.9995\}$ | 0/4 firms reject after integer-jitter; raw rejection was integer-tie artefact |
|
||||
| Big-4 pooled dHash: 2×2 factorial | firm-centred + jittered (5 seeds) | $p_{\text{med}} = 0.35$, 0/5 seeds reject | combined corrections eliminate rejection; multimodality is composition + integer artefact |
|
||||
| Integer-histogram valley near $\text{dHash} \approx 5$ | within each Big-4 firm | none (0/4 firms) | no within-firm dHash antimode at the inherited HC cutoff |
|
||||
|
||||
(Source: Scripts 39b, 39c, 39d, 39e; bootstrap $n_{\text{boot}} = 2000$; jitter $\sim \mathrm{U}[-0.5, +0.5]$.)
|
||||
|
||||
### M.2 Anchor-based inter-CPA pair-level ICCR (Script 40b)
|
||||
|
||||
**Table XX.** Big-4 inter-CPA per-comparison ICCR sweep, $n = 5 \times 10^5$ pairs (Big-4 scope; v4 new).
|
||||
|
||||
| Threshold | Per-comparison ICCR | 95% Wilson CI |
|
||||
|---|---|---|
|
||||
| cos $> 0.945$ (v3.x published "natural threshold") | $0.00081$ | $[0.00073, 0.00089]$ |
|
||||
| cos $> 0.95$ (inherited operating point) | $0.00060$ | $[0.00053, 0.00067]$ |
|
||||
| cos $> 0.97$ | $0.00024$ | $[0.00020, 0.00029]$ |
|
||||
| cos $> 0.98$ | $0.00009$ | $[0.00007, 0.00012]$ |
|
||||
| dHash $\leq 5$ (inherited operating point) | $0.00129$ | $[0.00120, 0.00140]$ |
|
||||
| dHash $\leq 4$ | $0.00050$ | $[0.00044, 0.00057]$ |
|
||||
| dHash $\leq 3$ | $0.00019$ | $[0.00015, 0.00023]$ |
|
||||
| Joint: cos $> 0.95$ AND dHash $\leq 5$ (any-pair semantics) | $0.00014$ | — |
|
||||
| Joint: cos $> 0.95$ AND dHash $\leq 4$ (any-pair) | $0.00011$ | — |
|
||||
|
||||
Conditional ICCR(dHash $\leq 5$ | cos $> 0.95$) $= 0.234$ (Wilson 95% $[0.190, 0.285]$; $70$ of $299$ pairs).
|
||||
|
||||
The cos $> 0.95$ row replicates v3.20.0 §IV-F.1 Table X (v3 reported $0.0005$ under prior "FAR" terminology). The dHash row and joint row are v4 new.
|
||||
|
||||
### M.3 Pool-normalised per-signature ICCR (Script 43)
|
||||
|
||||
**Table XXI.** Pool-normalised per-signature ICCR under the deployed any-pair HC rule (cos $> 0.95$ AND dHash $\leq 5$); $n_{\text{sig}} = 150{,}453$ (vector-complete Big-4); CPA-block bootstrap $n_{\text{boot}} = 1000$.
|
||||
|
||||
| Scope | Per-signature ICCR | Wilson 95% CI | CPA-bootstrap 95% CI |
|
||||
|---|---|---|---|
|
||||
| Big-4 pooled (any-pair, deployed) | $0.1102$ | $[0.1086, 0.1118]$ | $[0.0908, 0.1330]$ |
|
||||
| Big-4 pooled (same-pair, stricter alternative) | $0.0827$ | $[0.0813, 0.0841]$ | $[0.0668, 0.1021]$ |
|
||||
| Firm A (any-pair) | $0.2594$ | — | — |
|
||||
| Firm B (any-pair) | $0.0147$ | — | — |
|
||||
| Firm C (any-pair) | $0.0053$ | — | — |
|
||||
| Firm D (any-pair) | $0.0110$ | — | — |
|
||||
| Pool-size decile 1 (smallest pools) any-pair | $0.0249$ | — | — |
|
||||
| Pool-size decile 10 (largest pools) any-pair | $0.1905$ | — | — |
|
||||
|
||||
Decile trend is broadly monotone in pool size with two minor reversals (decile 5 and decile 9 dip below their predecessors). Stricter operating point cos $> 0.95$ AND dHash $\leq 3$ (same-pair) gives per-signature ICCR $0.0449$.
|
||||
|
||||
### M.4 Document-level ICCR under three alarm definitions (Script 45)
|
||||
|
||||
**Table XXII.** Document-level inter-CPA ICCR by alarm definition; $n_{\text{docs}} = 75{,}233$.
|
||||
|
||||
| Alarm definition | Alarm set | Document-level ICCR | Wilson 95% CI |
|
||||
|---|---|---|---|
|
||||
| D1 | HC only | $0.1797$ | $[0.1770, 0.1825]$ |
|
||||
| D2 (operational) | HC + MC | $0.3375$ | $[0.3342, 0.3409]$ |
|
||||
| D3 | HC + MC + HSC | $0.3384$ | $[0.3351, 0.3418]$ |
|
||||
|
||||
Per-firm D2 document-level ICCR: Firm A $0.6201$ ($n = 30{,}226$); Firm B $0.1600$ ($n = 17{,}127$); Firm C $0.1635$ ($n = 19{,}501$); Firm D $0.0863$ ($n = 8{,}379$).
|
||||
|
||||
### M.5 Firm heterogeneity logistic regression and cross-firm hit matrix (Script 44)
|
||||
|
||||
**Table XXIII.** Logistic regression of per-signature any-pair HC hit indicator on firm dummies and centred log pool size (Firm A reference).
|
||||
|
||||
| Term | Odds ratio (vs Firm A) | Direction |
|
||||
|---|---|---|
|
||||
| Firm B | $0.053$ | $\sim 19\times$ lower odds than Firm A |
|
||||
| Firm C | $0.010$ | $\sim 100\times$ lower odds than Firm A |
|
||||
| Firm D | $0.027$ | $\sim 37\times$ lower odds than Firm A |
|
||||
| log(pool size, centred) | $4.01$ | $\sim 4\times$ higher odds per log unit pool size |
|
||||
|
||||
Per-decile per-firm rates (Table not duplicated here; Script 44 decile table available in the supplementary report): within every pool-size decile, Firms B/C/D show rates of $0.0006$–$0.0358$ while Firm A ranges $0.0541$–$0.5958$. The firm gap survives within matched pool sizes.
|
||||
|
||||
**Table XXIV.** Cross-firm hit matrix among Big-4 source signatures with any-pair HC hit; max-cosine partner firm (counts).
|
||||
|
||||
| Source firm | Firm A cand. | Firm B | Firm C | Firm D | non-Big-4 | n hits |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Firm A | $14{,}447$ | $95$ | $44$ | $19$ | $17$ | $14{,}622$ |
|
||||
| Firm B | $92$ | $371$ | $8$ | $4$ | $9$ | $484$ |
|
||||
| Firm C | $16$ | $7$ | $149$ | $5$ | $1$ | $178$ |
|
||||
| Firm D | $22$ | $2$ | $6$ | $106$ | $1$ | $137$ |
|
||||
|
||||
Same-pair joint hits (single candidate satisfying both cos $> 0.95$ AND dHash $\leq 5$) are within-firm at rates $99.96\%$ / $97.7\%$ / $98.2\%$ / $97.0\%$ for Firms A/B/C/D respectively.
|
||||
|
||||
### M.6 Alert-rate sensitivity around inherited HC threshold (Script 46)
|
||||
|
||||
**Table XXV.** Local-gradient / median-gradient ratio at inherited thresholds (descriptive plateau diagnostic).
|
||||
|
||||
| Threshold | Local / median gradient ratio | Interpretation |
|
||||
|---|---|---|
|
||||
| cos $= 0.95$ (HC) | $\approx 25\times$ | locally sensitive (not plateau-stable) |
|
||||
| dHash $= 5$ (HC) | $\approx 3.8\times$ | locally sensitive (not plateau-stable) |
|
||||
| dHash $= 15$ (MC/HSC boundary) | $\approx 0.08$ | plateau-like (saturating tail) |
|
||||
|
||||
Big-4 observed deployed alert rate on actual same-CPA pools: per-signature HC $= 0.4958$; per-document HC $= 0.6228$. The deployed-rate excess over the inter-CPA proxy is $0.3856$ pp per-signature and $0.4431$ pp per-document; this excess is interpreted as a same-CPA repeatability signal under the §III-M caveats, not as a presumed true-positive rate.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 close-out checklist
|
||||
|
||||
@@ -0,0 +1,195 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 39b: Signature-Level Dip Test (multimodality at the signature cloud)
|
||||
============================================================================
|
||||
Phase 5 pre-emptive evidence. Script 34 / 36 already report Hartigan
|
||||
dip tests on the 437 accountant-level (cos_mean, dh_mean) means and
|
||||
both marginals reject unimodality at p < 5e-4. Reviewers may ask
|
||||
whether the same multimodality is detectable at the signature level
|
||||
itself (n = 150,442 Big-4 signatures) and whether the multimodality
|
||||
is a within-firm or only a between-firm phenomenon.
|
||||
|
||||
This script supplies the missing dip evidence on the raw signature
|
||||
cloud. It is a *diagnostic* in the same role as Scripts 34/36 dip
|
||||
tests: it does not derive an operational threshold; it characterises
|
||||
the marginal distributions of (cos, dh_indep) at the signature level.
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/signature_level_diptest/
|
||||
sig_diptest_results.json
|
||||
sig_diptest_report.md
|
||||
|
||||
Tests performed:
|
||||
A. Pooled Big-4 marginals (cos, dh_indep), n = 150,442
|
||||
B. Per-firm marginals (Firm A / B / C / D separately)
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
import diptest
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from scipy import stats
|
||||
from scipy.signal import find_peaks
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/signature_level_diptest')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
N_BOOT = 2000
|
||||
|
||||
|
||||
def load_big4_signatures():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.assigned_accountant, a.firm,
|
||||
s.max_similarity_to_same_accountant,
|
||||
CAST(s.min_dhash_independent AS REAL)
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
AND a.firm IN (?, ?, ?, ?)
|
||||
''', BIG4)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def kde_dip(values, n_boot=N_BOOT):
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
|
||||
kde = stats.gaussian_kde(arr, bw_method='silverman')
|
||||
xs = np.linspace(arr.min(), arr.max(), 2000)
|
||||
density = kde(xs)
|
||||
peaks, _ = find_peaks(density, prominence=density.max() * 0.02)
|
||||
antimodes = []
|
||||
for i in range(len(peaks) - 1):
|
||||
seg = density[peaks[i]:peaks[i + 1]]
|
||||
if not len(seg):
|
||||
continue
|
||||
local = peaks[i] + int(np.argmin(seg))
|
||||
antimodes.append(float(xs[local]))
|
||||
return {
|
||||
'n': int(len(arr)),
|
||||
'dip': float(dip),
|
||||
'dip_pvalue': float(pval),
|
||||
'unimodal_alpha05': bool(pval > 0.05),
|
||||
'n_modes': int(len(peaks)),
|
||||
'mode_locations': [float(xs[p]) for p in peaks],
|
||||
'antimodes': antimodes,
|
||||
'n_boot': int(n_boot),
|
||||
}
|
||||
|
||||
|
||||
def _fmt_p(p):
|
||||
if p == 0.0:
|
||||
return '< 5e-4 (no bootstrap replicate exceeded observed dip)'
|
||||
return f'{p:.4g}'
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 39b: Signature-Level Dip Test')
|
||||
print('=' * 72)
|
||||
rows = load_big4_signatures()
|
||||
cos_all = np.array([r[2] for r in rows], dtype=float)
|
||||
dh_all = np.array([r[3] for r in rows], dtype=float)
|
||||
firms = np.array([ALIAS[r[1]] for r in rows])
|
||||
print(f'\nLoaded {len(rows):,} Big-4 signatures')
|
||||
for f in sorted(set(firms)):
|
||||
print(f' {f}: {(firms == f).sum():,}')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '39b',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_total': int(len(rows)),
|
||||
'n_boot': N_BOOT,
|
||||
'note': ('Signature-level Hartigan dip test on Big-4 '
|
||||
'(cos, dh_indep) marginals; pooled and per-firm.'),
|
||||
},
|
||||
'pooled': {},
|
||||
'per_firm': {},
|
||||
}
|
||||
|
||||
# A. Pooled
|
||||
print('\n[A] Pooled Big-4')
|
||||
for desc, arr in [('cos', cos_all), ('dh_indep', dh_all)]:
|
||||
r = kde_dip(arr)
|
||||
results['pooled'][desc] = r
|
||||
print(f' {desc}: n={r["n"]:,}, dip={r["dip"]:.5f}, '
|
||||
f'p={_fmt_p(r["dip_pvalue"])}, n_modes={r["n_modes"]}')
|
||||
|
||||
# B. Per-firm
|
||||
print('\n[B] Per-firm')
|
||||
for f in sorted(set(firms)):
|
||||
mask = firms == f
|
||||
results['per_firm'][f] = {}
|
||||
for desc, arr in [('cos', cos_all[mask]), ('dh_indep', dh_all[mask])]:
|
||||
r = kde_dip(arr)
|
||||
results['per_firm'][f][desc] = r
|
||||
print(f' {f} {desc}: n={r["n"]:,}, dip={r["dip"]:.5f}, '
|
||||
f'p={_fmt_p(r["dip_pvalue"])}, n_modes={r["n_modes"]}')
|
||||
|
||||
json_path = OUT / 'sig_diptest_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = ['# Signature-Level Dip Test (Script 39b)',
|
||||
'',
|
||||
f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Bootstrap replicates: {N_BOOT}',
|
||||
'',
|
||||
'## A. Pooled Big-4 signature cloud',
|
||||
'',
|
||||
f'n = {results["meta"]["n_total"]:,} signatures',
|
||||
'',
|
||||
'| Marginal | dip | p (boot) | n_modes | unimodal @0.05 |',
|
||||
'|---|---|---|---|---|']
|
||||
for desc in ['cos', 'dh_indep']:
|
||||
r = results['pooled'][desc]
|
||||
md.append(f'| {desc} | {r["dip"]:.5f} | {_fmt_p(r["dip_pvalue"])} | '
|
||||
f'{r["n_modes"]} | {r["unimodal_alpha05"]} |')
|
||||
|
||||
md += ['', '## B. Per-firm signature-level dip tests', '',
|
||||
'| Firm | Marginal | n | dip | p (boot) | n_modes | unimodal @0.05 |',
|
||||
'|---|---|---|---|---|---|---|']
|
||||
for f in sorted(results['per_firm']):
|
||||
for desc in ['cos', 'dh_indep']:
|
||||
r = results['per_firm'][f][desc]
|
||||
md.append(f'| {f} | {desc} | {r["n"]:,} | {r["dip"]:.5f} | '
|
||||
f'{_fmt_p(r["dip_pvalue"])} | {r["n_modes"]} | '
|
||||
f'{r["unimodal_alpha05"]} |')
|
||||
md += ['',
|
||||
'## Reading guide',
|
||||
'',
|
||||
('A unimodality rejection at the signature level confirms '
|
||||
'multimodal structure independent of accountant-level '
|
||||
'aggregation. A within-firm rejection further indicates the '
|
||||
'multimodality is not solely a between-firm artefact. A '
|
||||
'within-firm non-rejection (e.g., Firm A) is consistent with '
|
||||
'that firm being concentrated in a single mechanism corner.'),
|
||||
'',
|
||||
('All thresholds and operational classifiers remain those of '
|
||||
'v3.x §III-K and v4.0 §III-J; this script supplies diagnostic '
|
||||
'evidence only.'),
|
||||
'']
|
||||
md_path = OUT / 'sig_diptest_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,214 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 39c: Mid/Small-Firm Signature-Level Dip Test
|
||||
====================================================
|
||||
Companion to Script 39b. 39b showed every Big-4 firm rejects
|
||||
unimodality on the dHash signature marginal (p < 5e-4 in each
|
||||
of A/B/C/D) while every Big-4 firm fails to reject unimodality
|
||||
on the cosine marginal. This script asks the same questions of
|
||||
the mid/small-firm population (non-Big-4):
|
||||
|
||||
1. Does the pooled mid/small-firm signature cloud show the same
|
||||
dHash multimodality?
|
||||
2. Within individual mid/small firms (those with enough
|
||||
signatures to support the test), does the dHash multimodality
|
||||
hold firm-internally as it does in Big-4?
|
||||
|
||||
If yes, the dHash signature-level multimodality is corpus-universal
|
||||
and the Big-4 scope restriction of v4.0 is not necessary on dHash
|
||||
grounds (cf §III-G item 2 which currently rests on Big-4-level
|
||||
multimodality). The cosine axis is reported alongside for
|
||||
completeness, but no v4.0 claim turns on cosine multimodality
|
||||
outside Big-4.
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/midsmall_signature_diptest/
|
||||
midsmall_diptest_results.json
|
||||
midsmall_diptest_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
import diptest
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from scipy import stats
|
||||
from scipy.signal import find_peaks
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/midsmall_signature_diptest')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
N_BOOT = 2000
|
||||
SINGLE_FIRM_MIN_SIG = 500 # minimum signature count to run a per-firm dip test
|
||||
|
||||
|
||||
def load_non_big4_signatures():
|
||||
conn = sqlite3.connect(DB)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT a.firm,
|
||||
s.max_similarity_to_same_accountant,
|
||||
CAST(s.min_dhash_independent AS REAL)
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
AND a.firm IS NOT NULL
|
||||
AND a.firm NOT IN (?, ?, ?, ?)
|
||||
''', BIG4)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def kde_dip(values, n_boot=N_BOOT):
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
if len(arr) < 10:
|
||||
return {'n': int(len(arr)), 'skipped': 'too few points'}
|
||||
dip, pval = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
|
||||
kde = stats.gaussian_kde(arr, bw_method='silverman')
|
||||
xs = np.linspace(arr.min(), arr.max(), 2000)
|
||||
density = kde(xs)
|
||||
peaks, _ = find_peaks(density, prominence=density.max() * 0.02)
|
||||
antimodes = []
|
||||
for i in range(len(peaks) - 1):
|
||||
seg = density[peaks[i]:peaks[i + 1]]
|
||||
if not len(seg):
|
||||
continue
|
||||
local = peaks[i] + int(np.argmin(seg))
|
||||
antimodes.append(float(xs[local]))
|
||||
return {
|
||||
'n': int(len(arr)),
|
||||
'dip': float(dip),
|
||||
'dip_pvalue': float(pval),
|
||||
'unimodal_alpha05': bool(pval > 0.05),
|
||||
'n_modes': int(len(peaks)),
|
||||
'mode_locations': [float(xs[p]) for p in peaks],
|
||||
'antimodes': antimodes,
|
||||
'n_boot': int(n_boot),
|
||||
}
|
||||
|
||||
|
||||
def _fmt_p(p):
|
||||
if p == 0.0:
|
||||
return '< 5e-4'
|
||||
return f'{p:.4g}'
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 39c: Mid/Small-Firm Signature-Level Dip Test')
|
||||
print('=' * 72)
|
||||
rows = load_non_big4_signatures()
|
||||
cos_all = np.array([r[1] for r in rows], dtype=float)
|
||||
dh_all = np.array([r[2] for r in rows], dtype=float)
|
||||
firms = np.array([r[0] for r in rows])
|
||||
n_total = len(rows)
|
||||
print(f'\nLoaded {n_total:,} non-Big-4 signatures across '
|
||||
f'{len(set(firms))} firms')
|
||||
|
||||
# Firm size table
|
||||
firm_counts = {}
|
||||
for f in firms:
|
||||
firm_counts[f] = firm_counts.get(f, 0) + 1
|
||||
top = sorted(firm_counts.items(), key=lambda x: -x[1])
|
||||
print('\nTop firms by signature count:')
|
||||
for f, n in top[:10]:
|
||||
print(f' {f}: {n:,}')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '39c',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_total': int(n_total),
|
||||
'n_firms': int(len(firm_counts)),
|
||||
'n_boot': N_BOOT,
|
||||
'single_firm_min_sig': SINGLE_FIRM_MIN_SIG,
|
||||
},
|
||||
'pooled': {},
|
||||
'per_firm_eligible': {},
|
||||
'firm_counts': dict(firm_counts),
|
||||
}
|
||||
|
||||
# A. Pooled non-Big-4
|
||||
print('\n[A] Pooled non-Big-4')
|
||||
for desc, arr in [('cos', cos_all), ('dh_indep', dh_all)]:
|
||||
r = kde_dip(arr)
|
||||
results['pooled'][desc] = r
|
||||
print(f' {desc}: n={r["n"]:,}, dip={r["dip"]:.5f}, '
|
||||
f'p={_fmt_p(r["dip_pvalue"])}, n_modes={r["n_modes"]}')
|
||||
|
||||
# B. Per-firm (only firms with >= SINGLE_FIRM_MIN_SIG signatures)
|
||||
eligible = [f for f, n in firm_counts.items() if n >= SINGLE_FIRM_MIN_SIG]
|
||||
print(f'\n[B] Per-firm dip test '
|
||||
f'(firms with >= {SINGLE_FIRM_MIN_SIG} signatures: {len(eligible)})')
|
||||
for f in sorted(eligible, key=lambda x: -firm_counts[x]):
|
||||
mask = firms == f
|
||||
results['per_firm_eligible'][f] = {'n': int(mask.sum())}
|
||||
for desc, arr in [('cos', cos_all[mask]), ('dh_indep', dh_all[mask])]:
|
||||
r = kde_dip(arr)
|
||||
results['per_firm_eligible'][f][desc] = r
|
||||
print(f' {f[:20]:<22s} {desc}: n={r["n"]:,}, dip={r["dip"]:.5f}, '
|
||||
f'p={_fmt_p(r["dip_pvalue"])}, n_modes={r["n_modes"]}')
|
||||
|
||||
json_path = OUT / 'midsmall_diptest_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = ['# Mid/Small-Firm Signature-Level Dip Test (Script 39c)',
|
||||
'',
|
||||
f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Bootstrap replicates: {N_BOOT}',
|
||||
'',
|
||||
'## A. Pooled non-Big-4 signature cloud',
|
||||
'',
|
||||
f'n = {n_total:,} signatures across '
|
||||
f'{results["meta"]["n_firms"]} firms',
|
||||
'',
|
||||
'| Marginal | dip | p (boot) | n_modes | unimodal @0.05 |',
|
||||
'|---|---|---|---|---|']
|
||||
for desc in ['cos', 'dh_indep']:
|
||||
r = results['pooled'][desc]
|
||||
md.append(f'| {desc} | {r["dip"]:.5f} | {_fmt_p(r["dip_pvalue"])} | '
|
||||
f'{r["n_modes"]} | {r["unimodal_alpha05"]} |')
|
||||
|
||||
md += ['', f'## B. Single mid/small firms (>= {SINGLE_FIRM_MIN_SIG} '
|
||||
f'signatures), {len(eligible)} qualify', '',
|
||||
'| Firm | Marginal | n | dip | p (boot) | n_modes | unimodal @0.05 |',
|
||||
'|---|---|---|---|---|---|---|']
|
||||
for f in sorted(eligible, key=lambda x: -firm_counts[x]):
|
||||
for desc in ['cos', 'dh_indep']:
|
||||
r = results['per_firm_eligible'][f][desc]
|
||||
md.append(f'| {f[:20]} | {desc} | {r["n"]:,} | {r["dip"]:.5f} | '
|
||||
f'{_fmt_p(r["dip_pvalue"])} | {r["n_modes"]} | '
|
||||
f'{r["unimodal_alpha05"]} |')
|
||||
|
||||
md += ['',
|
||||
'## Reading guide',
|
||||
'',
|
||||
('If the pooled-non-Big-4 dHash marginal rejects unimodality '
|
||||
'AND the qualifying individual mid/small firms also reject, '
|
||||
'the dHash within-firm replication regime structure is '
|
||||
'corpus-universal and not Big-4-specific. In that case the '
|
||||
'Big-4 scope of v4.0 is justified on cosine-axis grounds '
|
||||
'(Firm-A composition; §III-G item 1) and accountant-level '
|
||||
'LOOO reproducibility (§III-G item 3), but not on dHash '
|
||||
'multimodality grounds (§III-G item 2 should be re-scoped or '
|
||||
'qualified). If the per-firm dHash tests instead fail to '
|
||||
'reject inside mid/small firms, the dHash multimodality is '
|
||||
'Big-4-specific and §III-G item 2 holds as stated.'),
|
||||
'']
|
||||
md_path = OUT / 'midsmall_diptest_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,446 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 39d: dHash Discrete-Value Robustness Diagnostics
|
||||
========================================================
|
||||
Codex (gpt-5.5 xhigh) attack on Script 39b/39c findings revealed that
|
||||
the within-firm dHash dip-test rejections are driven by integer mass
|
||||
points (dHash takes integer values 0..64). A uniform jitter of
|
||||
[-0.5, +0.5] eliminates dip rejection in every firm tested. This
|
||||
script consolidates that finding into a permanent diagnostic and adds:
|
||||
|
||||
1. Raw vs jittered dip with multi-seed robustness (5 seeds)
|
||||
2. Integer-histogram valley analysis: locate local minima between
|
||||
adjacent peaks in the binned integer distribution; report whether
|
||||
any valley centers near dh = 5
|
||||
3. Firm-residualized dip on dHash (analog of cosine firm-mean
|
||||
centering that confirmed the cosine reframe)
|
||||
4. Pairwise pair-coincidence: does the same same-CPA pair achieve
|
||||
both max cosine and min dHash, or are the two descriptors
|
||||
attached to different pairs? Foundation for "is (cos, dh) a
|
||||
joint signature regime descriptor or two parallel descriptors"
|
||||
|
||||
This script does not derive operational thresholds; it characterises
|
||||
whether the v4.0 K=3 mixture and v3.x cos>0.95 AND dh<=5 rule are
|
||||
robustly supported once integer-discreteness artifacts are removed.
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/dhash_discrete_robustness/
|
||||
dhash_discrete_results.json
|
||||
dhash_discrete_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
import diptest
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/dhash_discrete_robustness')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
N_BOOT = 2000
|
||||
JITTER_SEEDS = [42, 43, 44, 45, 46]
|
||||
SINGLE_FIRM_MIN_SIG = 500
|
||||
|
||||
|
||||
def load_signatures():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT a.firm, s.assigned_accountant,
|
||||
s.max_similarity_to_same_accountant,
|
||||
CAST(s.min_dhash_independent AS REAL)
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
AND a.firm IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def dip(values, n_boot=N_BOOT):
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
d, p = diptest.diptest(arr, boot_pval=True, n_boot=n_boot)
|
||||
return float(d), float(p)
|
||||
|
||||
|
||||
def multi_seed_jitter_dip(values, seeds=JITTER_SEEDS, n_boot=N_BOOT):
|
||||
"""Compute dip stat + p-value across seeds; return distribution."""
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
stats = []
|
||||
for seed in seeds:
|
||||
rng = np.random.default_rng(seed)
|
||||
j = arr + rng.uniform(-0.5, 0.5, len(arr))
|
||||
d, p = diptest.diptest(j, boot_pval=True, n_boot=n_boot)
|
||||
stats.append({'seed': seed, 'dip': float(d), 'p': float(p)})
|
||||
return {
|
||||
'n_seeds': len(seeds),
|
||||
'p_min': min(s['p'] for s in stats),
|
||||
'p_max': max(s['p'] for s in stats),
|
||||
'p_median': float(np.median([s['p'] for s in stats])),
|
||||
'dip_min': min(s['dip'] for s in stats),
|
||||
'dip_max': max(s['dip'] for s in stats),
|
||||
'reject_at_05_count': int(sum(1 for s in stats if s['p'] <= 0.05)),
|
||||
'per_seed': stats,
|
||||
}
|
||||
|
||||
|
||||
def integer_histogram_valleys(values, max_bin=20):
|
||||
"""For integer-valued data, locate local minima in the count
|
||||
histogram on bins 0..max_bin. Returns valley positions and depths
|
||||
relative to flanking peaks."""
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
bins = np.arange(0, max_bin + 2) # 0, 1, ..., max_bin+1
|
||||
counts, edges = np.histogram(arr, bins=bins)
|
||||
centers = (edges[:-1] + edges[1:]) / 2.0
|
||||
valleys = []
|
||||
for i in range(1, len(counts) - 1):
|
||||
if counts[i] < counts[i - 1] and counts[i] < counts[i + 1]:
|
||||
left_peak = counts[i - 1]
|
||||
right_peak = counts[i + 1]
|
||||
min_peak = min(left_peak, right_peak)
|
||||
depth_rel = (min_peak - counts[i]) / min_peak if min_peak else 0
|
||||
valleys.append({
|
||||
'bin_center': float(centers[i]),
|
||||
'count': int(counts[i]),
|
||||
'left_peak_bin': int(centers[i - 1]),
|
||||
'left_peak_count': int(left_peak),
|
||||
'right_peak_bin': int(centers[i + 1]),
|
||||
'right_peak_count': int(right_peak),
|
||||
'depth_rel': float(depth_rel),
|
||||
})
|
||||
return {
|
||||
'histogram_bins_0_to_max': counts[:max_bin + 1].tolist(),
|
||||
'valleys': valleys,
|
||||
'note': ('valleys are bins where count < both neighbours; '
|
||||
'depth_rel = (min(neighbour) - bin) / min(neighbour). '
|
||||
'A genuine antimode would have a deep, stable valley '
|
||||
'with depth_rel > 0.1.'),
|
||||
}
|
||||
|
||||
|
||||
def firm_residualized(values, firm_labels):
|
||||
"""Return values with firm means subtracted (centered to grand mean
|
||||
over firms). Used to test whether residual within-firm structure
|
||||
rejects unimodality."""
|
||||
arr = np.asarray(values, dtype=float)
|
||||
firms = np.asarray(firm_labels)
|
||||
out = arr.copy()
|
||||
grand = float(np.mean(arr))
|
||||
for f in np.unique(firms):
|
||||
m = firms == f
|
||||
out[m] = arr[m] - float(np.mean(arr[m])) + grand
|
||||
return out
|
||||
|
||||
|
||||
def pair_coincidence_rate():
|
||||
"""Fraction of signatures whose max-cosine partner equals the
|
||||
min-dHash partner within the same-CPA cross-year pool."""
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT COUNT(*) AS n_total,
|
||||
SUM(CASE WHEN max_cosine_pair_id IS NOT NULL
|
||||
AND min_dhash_pair_id IS NOT NULL
|
||||
AND max_cosine_pair_id = min_dhash_pair_id
|
||||
THEN 1 ELSE 0 END) AS n_same_pair,
|
||||
SUM(CASE WHEN max_cosine_pair_id IS NOT NULL
|
||||
AND min_dhash_pair_id IS NOT NULL
|
||||
AND max_cosine_pair_id != min_dhash_pair_id
|
||||
THEN 1 ELSE 0 END) AS n_diff_pair,
|
||||
SUM(CASE WHEN max_cosine_pair_id IS NULL
|
||||
OR min_dhash_pair_id IS NULL
|
||||
THEN 1 ELSE 0 END) AS n_null
|
||||
FROM signatures
|
||||
''')
|
||||
row = cur.fetchone()
|
||||
conn.close()
|
||||
n_total, n_same, n_diff, n_null = row
|
||||
n_with_both = (n_same or 0) + (n_diff or 0)
|
||||
return {
|
||||
'n_total': int(n_total or 0),
|
||||
'n_with_both_pair_ids': int(n_with_both),
|
||||
'n_same_pair': int(n_same or 0),
|
||||
'n_diff_pair': int(n_diff or 0),
|
||||
'n_null': int(n_null or 0),
|
||||
'same_pair_rate': (float(n_same) / n_with_both
|
||||
if n_with_both else None),
|
||||
'note': ('rate computed over signatures where both '
|
||||
'max_cosine_pair_id and min_dhash_pair_id are present'),
|
||||
}
|
||||
|
||||
|
||||
def _fmt_p(p):
|
||||
return '< 5e-4' if p == 0.0 else f'{p:.4g}'
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 39d: dHash Discrete-Value Robustness Diagnostics')
|
||||
print('=' * 72)
|
||||
rows = load_signatures()
|
||||
firms_raw = np.array([r[0] for r in rows])
|
||||
cos = np.array([r[2] for r in rows], dtype=float)
|
||||
dh = np.array([r[3] for r in rows], dtype=float)
|
||||
is_big4 = np.isin(firms_raw, BIG4)
|
||||
n = len(rows)
|
||||
print(f'\nLoaded {n:,} signatures; Big-4 {is_big4.sum():,}, '
|
||||
f'non-Big-4 {(~is_big4).sum():,}')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '39d',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_total_signatures': int(n),
|
||||
'n_big4': int(is_big4.sum()),
|
||||
'n_non_big4': int((~is_big4).sum()),
|
||||
'n_boot': N_BOOT,
|
||||
'jitter_seeds': JITTER_SEEDS,
|
||||
'note': ('Diagnostic for dHash integer-mass-point artifact '
|
||||
'in dip test; codex round-29 attack on Script 39b/c'),
|
||||
},
|
||||
}
|
||||
|
||||
# ---- A. Raw vs multi-seed jittered dip ----
|
||||
print('\n[A] Raw vs jittered dip (5 seeds, n_boot=2000)')
|
||||
panels = {}
|
||||
# Big-4 pooled
|
||||
print(' Big-4 pooled:')
|
||||
raw_d, raw_p = dip(dh[is_big4])
|
||||
j = multi_seed_jitter_dip(dh[is_big4])
|
||||
panels['big4_pooled'] = {
|
||||
'n': int(is_big4.sum()),
|
||||
'raw': {'dip': raw_d, 'p': raw_p},
|
||||
'jittered': j,
|
||||
}
|
||||
print(f' raw: dip={raw_d:.5f}, p={_fmt_p(raw_p)}')
|
||||
print(f' jitter: p_median={j["p_median"]:.4g}, '
|
||||
f'p_range=[{j["p_min"]:.4g}, {j["p_max"]:.4g}], '
|
||||
f'reject@.05 in {j["reject_at_05_count"]}/5 seeds')
|
||||
# Each Big-4 firm
|
||||
for f in BIG4:
|
||||
mask = firms_raw == f
|
||||
if mask.sum() == 0:
|
||||
continue
|
||||
raw_d, raw_p = dip(dh[mask])
|
||||
j = multi_seed_jitter_dip(dh[mask])
|
||||
panels[ALIAS[f]] = {
|
||||
'n': int(mask.sum()),
|
||||
'raw': {'dip': raw_d, 'p': raw_p},
|
||||
'jittered': j,
|
||||
}
|
||||
print(f' {ALIAS[f]} (n={mask.sum():,}):')
|
||||
print(f' raw: dip={raw_d:.5f}, p={_fmt_p(raw_p)}')
|
||||
print(f' jitter: p_median={j["p_median"]:.4g}, '
|
||||
f'reject@.05 in {j["reject_at_05_count"]}/5 seeds')
|
||||
# Non-Big-4 pooled
|
||||
print(' Non-Big-4 pooled:')
|
||||
raw_d, raw_p = dip(dh[~is_big4])
|
||||
j = multi_seed_jitter_dip(dh[~is_big4])
|
||||
panels['non_big4_pooled'] = {
|
||||
'n': int((~is_big4).sum()),
|
||||
'raw': {'dip': raw_d, 'p': raw_p},
|
||||
'jittered': j,
|
||||
}
|
||||
print(f' raw: dip={raw_d:.5f}, p={_fmt_p(raw_p)}')
|
||||
print(f' jitter: p_median={j["p_median"]:.4g}, '
|
||||
f'reject@.05 in {j["reject_at_05_count"]}/5 seeds')
|
||||
results['raw_vs_jittered_dip'] = panels
|
||||
|
||||
# ---- B. Integer-histogram valley analysis ----
|
||||
print('\n[B] Integer-histogram valley analysis (bins 0..20)')
|
||||
valleys = {}
|
||||
valleys['big4_pooled'] = integer_histogram_valleys(dh[is_big4])
|
||||
print(f' Big-4 pooled: {len(valleys["big4_pooled"]["valleys"])} valleys')
|
||||
for v in valleys['big4_pooled']['valleys']:
|
||||
print(f' bin {v["bin_center"]:.1f}: count={v["count"]}, '
|
||||
f'depth_rel={v["depth_rel"]:.3f}')
|
||||
for f in BIG4:
|
||||
mask = firms_raw == f
|
||||
if mask.sum() == 0:
|
||||
continue
|
||||
valleys[ALIAS[f]] = integer_histogram_valleys(dh[mask])
|
||||
print(f' {ALIAS[f]}: '
|
||||
f'{len(valleys[ALIAS[f]]["valleys"])} valleys')
|
||||
for v in valleys[ALIAS[f]]['valleys']:
|
||||
print(f' bin {v["bin_center"]:.1f}: count={v["count"]}, '
|
||||
f'depth_rel={v["depth_rel"]:.3f}')
|
||||
valleys['non_big4_pooled'] = integer_histogram_valleys(dh[~is_big4])
|
||||
print(f' Non-Big-4 pooled: '
|
||||
f'{len(valleys["non_big4_pooled"]["valleys"])} valleys')
|
||||
for v in valleys['non_big4_pooled']['valleys']:
|
||||
print(f' bin {v["bin_center"]:.1f}: count={v["count"]}, '
|
||||
f'depth_rel={v["depth_rel"]:.3f}')
|
||||
results['integer_histogram_valleys'] = valleys
|
||||
|
||||
# ---- C. Firm-residualized dip on dHash, signature level ----
|
||||
print('\n[C] Firm-residualized dHash dip (signature level)')
|
||||
firm_labels = np.array([
|
||||
ALIAS[f] if f in ALIAS else f'M:{f}'
|
||||
for f in firms_raw
|
||||
])
|
||||
# Big-4 only residualized over A/B/C/D
|
||||
dh_resid_big4 = firm_residualized(dh[is_big4], firm_labels[is_big4])
|
||||
raw_d, raw_p = dip(dh[is_big4])
|
||||
res_d, res_p = dip(dh_resid_big4)
|
||||
print(f' Big-4 raw: dip={raw_d:.5f}, p={_fmt_p(raw_p)}')
|
||||
print(f' Big-4 residualized: dip={res_d:.5f}, p={_fmt_p(res_p)}')
|
||||
# Also non-Big-4 residualized over their firms
|
||||
dh_resid_nbig4 = firm_residualized(dh[~is_big4], firm_labels[~is_big4])
|
||||
raw_d_n, raw_p_n = dip(dh[~is_big4])
|
||||
res_d_n, res_p_n = dip(dh_resid_nbig4)
|
||||
print(f' Non-Big-4 raw: dip={raw_d_n:.5f}, p={_fmt_p(raw_p_n)}')
|
||||
print(f' Non-Big-4 residualized: dip={res_d_n:.5f}, p={_fmt_p(res_p_n)}')
|
||||
results['firm_residualized_dh_dip'] = {
|
||||
'big4': {
|
||||
'raw': {'dip': raw_d, 'p': raw_p},
|
||||
'firm_residualized': {'dip': res_d, 'p': res_p},
|
||||
},
|
||||
'non_big4': {
|
||||
'raw': {'dip': raw_d_n, 'p': raw_p_n},
|
||||
'firm_residualized': {'dip': res_d_n, 'p': res_p_n},
|
||||
},
|
||||
'note': ('Residualization subtracts each firm mean dh and adds '
|
||||
'back the grand mean. If residual dip rejects, there is '
|
||||
'genuine within-firm dh multimodality independent of '
|
||||
'between-firm mean shifts. If residual fails to reject, '
|
||||
'all dh "multimodality" was between-firm composition.'),
|
||||
}
|
||||
|
||||
# ---- D. Pair-coincidence rate ----
|
||||
print('\n[D] Pair-coincidence rate (max-cos pair vs min-dh pair)')
|
||||
try:
|
||||
pc = pair_coincidence_rate()
|
||||
if pc['same_pair_rate'] is not None:
|
||||
print(f' n_with_both: {pc["n_with_both_pair_ids"]:,}, '
|
||||
f'same-pair rate: {pc["same_pair_rate"]:.4f}')
|
||||
else:
|
||||
print(' Pair IDs not stored in signatures table (skipped)')
|
||||
results['pair_coincidence'] = pc
|
||||
except sqlite3.OperationalError as e:
|
||||
print(f' SQL error (pair_id columns may not exist): {e}')
|
||||
results['pair_coincidence'] = {
|
||||
'error': str(e),
|
||||
'note': ('signatures table lacks max_cosine_pair_id / '
|
||||
'min_dhash_pair_id columns; analysis skipped'),
|
||||
}
|
||||
|
||||
json_path = OUT / 'dhash_discrete_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
# ---- Report markdown ----
|
||||
md = ['# dHash Discrete-Value Robustness Diagnostics (Script 39d)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Bootstrap replicates: {N_BOOT}; jitter seeds: {JITTER_SEEDS}',
|
||||
'',
|
||||
'## A. Raw vs jittered dHash dip (signature level)',
|
||||
'',
|
||||
('dHash is integer-valued in [0, 64]. A raw dip test on '
|
||||
'integer mass points may reject unimodality due to discrete '
|
||||
'spikes rather than a continuous bimodal density. We add '
|
||||
'uniform jitter in [-0.5, +0.5] over 5 seeds and re-test.'),
|
||||
'',
|
||||
'| Scope | n | raw dip | raw p | jitter p median | jitter reject@.05 / 5 seeds |',
|
||||
'|---|---|---|---|---|---|']
|
||||
for key, label in [('big4_pooled', 'Big-4 pooled')] + \
|
||||
[(ALIAS[f], ALIAS[f]) for f in BIG4] + \
|
||||
[('non_big4_pooled', 'Non-Big-4 pooled')]:
|
||||
if key in panels:
|
||||
p = panels[key]
|
||||
md.append(f'| {label} | {p["n"]:,} | '
|
||||
f'{p["raw"]["dip"]:.5f} | '
|
||||
f'{_fmt_p(p["raw"]["p"])} | '
|
||||
f'{p["jittered"]["p_median"]:.4g} | '
|
||||
f'{p["jittered"]["reject_at_05_count"]}/5 |')
|
||||
md += ['',
|
||||
'**Interpretation.** If jittered dip ceases to reject in all '
|
||||
'panels, the raw-data rejection was driven by integer ties '
|
||||
'rather than a continuous bimodal density. Codex round-29 '
|
||||
'observed this pattern; this script confirms with multi-seed '
|
||||
'robustness.',
|
||||
'',
|
||||
'## B. Integer-histogram valley locations (bins 0..20)',
|
||||
'',
|
||||
('For each scope, list bins where count is strictly less '
|
||||
'than both neighbours, with relative depth '
|
||||
'(min(neighbour) - bin) / min(neighbour). A genuine '
|
||||
'antimode would show a deep, stable valley; integer-noise '
|
||||
'valleys are shallow and inconsistent across firms.'),
|
||||
'']
|
||||
for key, label in [('big4_pooled', 'Big-4 pooled')] + \
|
||||
[(ALIAS[f], ALIAS[f]) for f in BIG4] + \
|
||||
[('non_big4_pooled', 'Non-Big-4 pooled')]:
|
||||
if key in valleys:
|
||||
v_list = valleys[key]['valleys']
|
||||
if not v_list:
|
||||
md.append(f'- **{label}**: no integer-histogram valleys '
|
||||
f'in 0..20')
|
||||
else:
|
||||
desc = ', '.join(
|
||||
f'dh={v["bin_center"]:.0f} (depth_rel={v["depth_rel"]:.3f})'
|
||||
for v in v_list)
|
||||
md.append(f'- **{label}**: {desc}')
|
||||
md += ['',
|
||||
'## C. Firm-residualized dHash dip',
|
||||
'',
|
||||
('Subtract each firm mean dHash; add back grand mean. If '
|
||||
'residual rejects, within-firm multimodality is genuine. '
|
||||
'If residual fails to reject, all dh "multimodality" was '
|
||||
'between-firm composition.'),
|
||||
'',
|
||||
'| Scope | raw dip | raw p | residualized dip | residualized p |',
|
||||
'|---|---|---|---|---|']
|
||||
fr = results['firm_residualized_dh_dip']
|
||||
md += [f'| Big-4 | {fr["big4"]["raw"]["dip"]:.5f} | '
|
||||
f'{_fmt_p(fr["big4"]["raw"]["p"])} | '
|
||||
f'{fr["big4"]["firm_residualized"]["dip"]:.5f} | '
|
||||
f'{_fmt_p(fr["big4"]["firm_residualized"]["p"])} |',
|
||||
f'| Non-Big-4 | {fr["non_big4"]["raw"]["dip"]:.5f} | '
|
||||
f'{_fmt_p(fr["non_big4"]["raw"]["p"])} | '
|
||||
f'{fr["non_big4"]["firm_residualized"]["dip"]:.5f} | '
|
||||
f'{_fmt_p(fr["non_big4"]["firm_residualized"]["p"])} |']
|
||||
md += ['',
|
||||
'## D. Max-cos pair vs min-dh pair coincidence',
|
||||
'']
|
||||
pc = results.get('pair_coincidence', {})
|
||||
if 'same_pair_rate' in pc and pc['same_pair_rate'] is not None:
|
||||
md += [f'- n_signatures with both pair IDs: '
|
||||
f'{pc["n_with_both_pair_ids"]:,}',
|
||||
f'- same-pair rate: {pc["same_pair_rate"]:.4f} '
|
||||
f'({pc["n_same_pair"]:,} of '
|
||||
f'{pc["n_with_both_pair_ids"]:,})',
|
||||
'',
|
||||
('A high rate (>0.8) supports a single-pair regime '
|
||||
'descriptor language (cos and dh attached to the same '
|
||||
'partner). A low rate indicates the two descriptors '
|
||||
'attach to different partners and should be discussed '
|
||||
'as parallel-but-different evidence.')]
|
||||
elif 'error' in pc:
|
||||
md += [f'- column not present in DB: {pc["error"]}',
|
||||
('- note: schema-dependent; pair IDs not currently stored '
|
||||
'in signatures table.')]
|
||||
md.append('')
|
||||
md_path = OUT / 'dhash_discrete_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,250 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 39e: dHash Firm-Residualized + Jittered Dip (final test)
|
||||
================================================================
|
||||
Script 39d showed:
|
||||
- Within-firm dh dip rejections all vanish after jitter (integer
|
||||
artifact)
|
||||
- Big-4 pooled dh dip survives jitter (p_median=0 over 5 seeds)
|
||||
|
||||
But Firm A mean dh = 2.73 vs Firms B/C/D ~6.5-7.4 -- a large
|
||||
between-firm location shift, analogous to the cosine case where
|
||||
firm-mean centering eliminated rejection.
|
||||
|
||||
This script applies BOTH corrections simultaneously:
|
||||
1. Firm-mean centering (remove between-firm location shifts)
|
||||
2. Uniform jitter in [-0.5, +0.5] (remove integer ties)
|
||||
|
||||
If the doubly-corrected dh distribution rejects unimodality, the
|
||||
Big-4 pooled multimodality is a genuine within-population, continuous
|
||||
phenomenon. If it fails to reject, dh "multimodality" is fully
|
||||
explained by between-firm composition (same conclusion as cosine).
|
||||
|
||||
Multi-seed (5 seeds) for robustness.
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/dhash_discrete_robustness/
|
||||
dhash_residualized_jittered_results.json
|
||||
dhash_residualized_jittered_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
import diptest
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/dhash_discrete_robustness')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
N_BOOT = 2000
|
||||
SEEDS = [42, 43, 44, 45, 46]
|
||||
|
||||
|
||||
def load_signatures():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT a.firm, CAST(s.min_dhash_independent AS REAL)
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
AND a.firm IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def firm_residualize(values, firm_labels):
|
||||
arr = np.asarray(values, dtype=float)
|
||||
firms = np.asarray(firm_labels)
|
||||
out = arr.copy()
|
||||
grand = float(np.mean(arr))
|
||||
for f in np.unique(firms):
|
||||
m = firms == f
|
||||
out[m] = arr[m] - float(np.mean(arr[m])) + grand
|
||||
return out
|
||||
|
||||
|
||||
def dip_multi(values, seeds, with_jitter, n_boot=N_BOOT):
|
||||
arr = np.asarray(values, dtype=float)
|
||||
arr = arr[np.isfinite(arr)]
|
||||
results = []
|
||||
for seed in seeds:
|
||||
rng = np.random.default_rng(seed)
|
||||
v = arr + rng.uniform(-0.5, 0.5, len(arr)) if with_jitter else arr
|
||||
d, p = diptest.diptest(v, boot_pval=True, n_boot=n_boot)
|
||||
results.append({'seed': seed, 'dip': float(d), 'p': float(p)})
|
||||
if not with_jitter:
|
||||
break # without jitter the seed is irrelevant
|
||||
return results
|
||||
|
||||
|
||||
def _fmt_p(p):
|
||||
return '< 5e-4' if p == 0.0 else f'{p:.4g}'
|
||||
|
||||
|
||||
def summarize(name, results):
|
||||
ps = [r['p'] for r in results]
|
||||
ds = [r['dip'] for r in results]
|
||||
return {
|
||||
'name': name,
|
||||
'n_seeds': len(results),
|
||||
'dip_min': min(ds), 'dip_max': max(ds), 'dip_median': float(np.median(ds)),
|
||||
'p_min': min(ps), 'p_max': max(ps), 'p_median': float(np.median(ps)),
|
||||
'reject_at_05_count': int(sum(1 for p in ps if p <= 0.05)),
|
||||
'per_seed': results,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 39e: dHash Firm-Residualized + Jittered Dip')
|
||||
print('=' * 72)
|
||||
rows = load_signatures()
|
||||
firms_raw = np.array([r[0] for r in rows])
|
||||
dh = np.array([r[1] for r in rows], dtype=float)
|
||||
is_big4 = np.isin(firms_raw, BIG4)
|
||||
big4_dh = dh[is_big4]
|
||||
big4_firms = np.array([ALIAS[f] for f in firms_raw[is_big4]])
|
||||
|
||||
print(f'\nLoaded {len(rows):,} signatures; Big-4 {is_big4.sum():,}')
|
||||
print('\nPer-firm Big-4 dh summary:')
|
||||
for f in sorted(set(big4_firms)):
|
||||
v = big4_dh[big4_firms == f]
|
||||
print(f' {f}: n={len(v):,} mean={v.mean():.3f} '
|
||||
f'median={np.median(v):.1f} sd={v.std():.3f}')
|
||||
|
||||
# ---- Test conditions, all on Big-4 signature-level dh ----
|
||||
panels = {}
|
||||
|
||||
# 1. Raw (no centering, no jitter)
|
||||
print('\n[1] Raw dh')
|
||||
r = dip_multi(big4_dh, [42], with_jitter=False)
|
||||
panels['raw'] = summarize('raw', r)
|
||||
print(f' dip={r[0]["dip"]:.5f}, p={_fmt_p(r[0]["p"])}')
|
||||
|
||||
# 2. Centered only (no jitter; integer values preserved)
|
||||
print('\n[2] Firm-mean centered, no jitter')
|
||||
centered = firm_residualize(big4_dh, big4_firms)
|
||||
r = dip_multi(centered, [42], with_jitter=False)
|
||||
panels['centered_only'] = summarize('centered_only', r)
|
||||
print(f' dip={r[0]["dip"]:.5f}, p={_fmt_p(r[0]["p"])}')
|
||||
|
||||
# 3. Jittered only (no centering)
|
||||
print('\n[3] Jittered (5 seeds), no centering')
|
||||
r = dip_multi(big4_dh, SEEDS, with_jitter=True)
|
||||
panels['jitter_only'] = summarize('jitter_only', r)
|
||||
print(f' p_median={panels["jitter_only"]["p_median"]:.4g}, '
|
||||
f'reject@.05 in '
|
||||
f'{panels["jitter_only"]["reject_at_05_count"]}/5 seeds')
|
||||
|
||||
# 4. Centered + jittered (THE key test)
|
||||
print('\n[4] Firm-mean centered + jittered (5 seeds) -- KEY TEST')
|
||||
r = dip_multi(centered, SEEDS, with_jitter=True)
|
||||
panels['centered_jittered'] = summarize('centered_jittered', r)
|
||||
print(f' p_median={panels["centered_jittered"]["p_median"]:.4g}, '
|
||||
f'reject@.05 in '
|
||||
f'{panels["centered_jittered"]["reject_at_05_count"]}/5 seeds')
|
||||
for s in r:
|
||||
print(f' seed {s["seed"]}: dip={s["dip"]:.5f}, p={_fmt_p(s["p"])}')
|
||||
|
||||
# Per-firm dh stats (re-confirm Firm A shift)
|
||||
firm_stats = {}
|
||||
for f in sorted(set(big4_firms)):
|
||||
v = big4_dh[big4_firms == f]
|
||||
firm_stats[f] = {
|
||||
'n': int(len(v)),
|
||||
'mean': float(v.mean()),
|
||||
'median': float(np.median(v)),
|
||||
'sd': float(v.std()),
|
||||
'p25': float(np.percentile(v, 25)),
|
||||
'p75': float(np.percentile(v, 75)),
|
||||
'pct_le_5': float(np.mean(v <= 5)),
|
||||
'pct_gt_15': float(np.mean(v > 15)),
|
||||
}
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '39e',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_big4_signatures': int(big4_dh.size),
|
||||
'n_boot': N_BOOT,
|
||||
'seeds': SEEDS,
|
||||
'note': ('Final test: does Big-4 pooled dh multimodality '
|
||||
'survive BOTH firm-mean centering and integer-tie '
|
||||
'jitter?'),
|
||||
},
|
||||
'panels': panels,
|
||||
'per_firm_dh_stats': firm_stats,
|
||||
}
|
||||
|
||||
json_path = OUT / 'dhash_residualized_jittered_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = [
|
||||
'# dHash Firm-Residualized + Jittered Dip (Script 39e)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Bootstrap replicates: {N_BOOT}; jitter seeds: {SEEDS}',
|
||||
'',
|
||||
'## Per-firm Big-4 dh summary',
|
||||
'', '| Firm | n | mean | median | sd | P25 | P75 | %<=5 | %>15 |',
|
||||
'|---|---|---|---|---|---|---|---|---|',
|
||||
]
|
||||
for f, s in firm_stats.items():
|
||||
md.append(f'| {f} | {s["n"]:,} | {s["mean"]:.3f} | '
|
||||
f'{s["median"]:.1f} | {s["sd"]:.3f} | '
|
||||
f'{s["p25"]:.1f} | {s["p75"]:.1f} | '
|
||||
f'{s["pct_le_5"]:.3f} | {s["pct_gt_15"]:.3f} |')
|
||||
md += [
|
||||
'',
|
||||
'## Dip test under four conditions (Big-4 pooled, sig-level)',
|
||||
'',
|
||||
'| Condition | dip | p (or p_median) | reject@.05 (seeds) |',
|
||||
'|---|---|---|---|',
|
||||
f'| 1. Raw (integer values) | {panels["raw"]["dip_median"]:.5f} '
|
||||
f'| {_fmt_p(panels["raw"]["p_median"])} | n/a (1 seed) |',
|
||||
f'| 2. Firm-mean centered, no jitter '
|
||||
f'| {panels["centered_only"]["dip_median"]:.5f} '
|
||||
f'| {_fmt_p(panels["centered_only"]["p_median"])} | n/a (1 seed) |',
|
||||
f'| 3. Jittered only (5 seeds) '
|
||||
f'| median {panels["jitter_only"]["dip_median"]:.5f} '
|
||||
f'| median {_fmt_p(panels["jitter_only"]["p_median"])} '
|
||||
f'| {panels["jitter_only"]["reject_at_05_count"]}/5 |',
|
||||
f'| 4. **Centered + jittered (5 seeds)** '
|
||||
f'| median {panels["centered_jittered"]["dip_median"]:.5f} '
|
||||
f'| median {_fmt_p(panels["centered_jittered"]["p_median"])} '
|
||||
f'| {panels["centered_jittered"]["reject_at_05_count"]}/5 |',
|
||||
'',
|
||||
'## Interpretation',
|
||||
'',
|
||||
('If Condition 4 still rejects unimodality, Big-4 dh has '
|
||||
'genuine within-population continuous multimodality '
|
||||
'independent of both between-firm location shifts and '
|
||||
'integer mass points. If Condition 4 fails to reject, the '
|
||||
'Big-4 pooled dh multimodality is fully explained by '
|
||||
'(between-firm mean shift) + (integer mass points). In the '
|
||||
'latter case, the dh axis carries no independent within-firm '
|
||||
'regime evidence beyond the cos axis.'),
|
||||
'',
|
||||
]
|
||||
md_path = OUT / 'dhash_residualized_jittered_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,413 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 40b: Inter-CPA FAR Sweep for cos and dHash (joint + marginal)
|
||||
=====================================================================
|
||||
After codex round-29 destroyed the distributional path to thresholds
|
||||
(K=3 mixture / dip / antimode shown composition-driven by Scripts
|
||||
39b–39e), v4.0 pivots to an anchor-based threshold framework:
|
||||
empirically derived from inter-CPA negative anchor specificity.
|
||||
|
||||
Inter-CPA pairs (different CPAs, all-firm) are the negative anchor:
|
||||
they are by definition not same-CPA replications, and the user's
|
||||
within-CPA mechanism-transition concern (a CPA might switch from
|
||||
hand-sign to template mid-career) does not enter the inter-CPA
|
||||
calibration because each sampled pair crosses CPA boundaries.
|
||||
|
||||
This script samples a large number of inter-CPA pairs and computes
|
||||
both descriptors per pair (cosine via feature_vector dot product;
|
||||
Hamming distance via dhash_vector XOR). It then sweeps:
|
||||
|
||||
1. FAR(cos > k) across k in [0.80, 0.99]
|
||||
2. FAR(dHash <= k) across k in [0, 20]
|
||||
3. Joint FAR(cos > 0.95 AND dHash <= k) for k in [0, 20]
|
||||
4. Conditional FAR(dHash <= k | cos > 0.95) -- the v3 inherited
|
||||
rule's marginal specificity contribution from dHash
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/inter_cpa_far_sweep/
|
||||
far_sweep_results.json
|
||||
far_sweep_report.md
|
||||
|
||||
Sample size: 500,000 inter-CPA pairs (matches v3 Script 10
|
||||
convention). Big-4-only and full-corpus variants both reported.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/inter_cpa_far_sweep')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
N_PAIRS = 500_000
|
||||
SEED = 42
|
||||
|
||||
COS_GRID = [0.80, 0.83, 0.85, 0.87, 0.89, 0.90, 0.91, 0.92, 0.93, 0.94,
|
||||
0.945, 0.95, 0.955, 0.96, 0.965, 0.97, 0.975, 0.98, 0.985,
|
||||
0.99]
|
||||
DH_GRID = list(range(0, 21))
|
||||
|
||||
|
||||
def hamming_64bit(a_bytes, b_bytes):
|
||||
"""Hamming distance between two 8-byte (64-bit) dHash byte strings."""
|
||||
a = int.from_bytes(a_bytes, 'big')
|
||||
b = int.from_bytes(b_bytes, 'big')
|
||||
return (a ^ b).bit_count()
|
||||
|
||||
|
||||
def load_signatures():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.signature_id, s.assigned_accountant, a.firm,
|
||||
s.feature_vector, s.dhash_vector
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.feature_vector IS NOT NULL
|
||||
AND s.dhash_vector IS NOT NULL
|
||||
AND a.firm IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def sample_inter_cpa_pairs(rows, n_pairs, seed, restrict_to_big4=False):
|
||||
"""Sample inter-CPA pairs and compute (cos, dh) for each."""
|
||||
rng = np.random.default_rng(seed)
|
||||
if restrict_to_big4:
|
||||
rows = [r for r in rows if r[2] in BIG4]
|
||||
scope = 'big4_only'
|
||||
else:
|
||||
scope = 'all_firms'
|
||||
print(f' [{scope}] {len(rows):,} signatures available')
|
||||
|
||||
by_acct = defaultdict(list)
|
||||
for r in rows:
|
||||
by_acct[r[1]].append(r)
|
||||
accountants = list(by_acct.keys())
|
||||
n_acct = len(accountants)
|
||||
print(f' [{scope}] {n_acct} accountants')
|
||||
|
||||
features = {a: np.stack(
|
||||
[np.frombuffer(r[3], dtype=np.float32) for r in by_acct[a]]
|
||||
) for a in accountants}
|
||||
dhashes = {a: [r[4] for r in by_acct[a]] for a in accountants}
|
||||
|
||||
cos_vals = np.empty(n_pairs, dtype=np.float32)
|
||||
dh_vals = np.empty(n_pairs, dtype=np.int32)
|
||||
n_done = 0
|
||||
for _ in range(n_pairs):
|
||||
i, j = rng.choice(n_acct, 2, replace=False)
|
||||
a1, a2 = accountants[i], accountants[j]
|
||||
n1, n2 = len(by_acct[a1]), len(by_acct[a2])
|
||||
k1 = int(rng.integers(0, n1))
|
||||
k2 = int(rng.integers(0, n2))
|
||||
f1 = features[a1][k1]
|
||||
f2 = features[a2][k2]
|
||||
cos = float(f1 @ f2)
|
||||
d = hamming_64bit(dhashes[a1][k1], dhashes[a2][k2])
|
||||
cos_vals[n_done] = cos
|
||||
dh_vals[n_done] = d
|
||||
n_done += 1
|
||||
return scope, cos_vals, dh_vals
|
||||
|
||||
|
||||
def wilson_ci(k, n, z=1.96):
|
||||
if n == 0:
|
||||
return (None, None)
|
||||
phat = k / n
|
||||
denom = 1 + z * z / n
|
||||
centre = (phat + z * z / (2 * n)) / denom
|
||||
half = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
|
||||
return (max(0.0, centre - half), min(1.0, centre + half))
|
||||
|
||||
|
||||
def far_at_cos(cos_vals, k):
|
||||
n = len(cos_vals)
|
||||
hits = int((cos_vals > k).sum())
|
||||
lo, hi = wilson_ci(hits, n)
|
||||
return {'k': float(k), 'n': n, 'hits': hits,
|
||||
'far': hits / n, 'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
|
||||
def far_at_dh_le(dh_vals, k):
|
||||
n = len(dh_vals)
|
||||
hits = int((dh_vals <= k).sum())
|
||||
lo, hi = wilson_ci(hits, n)
|
||||
return {'k': int(k), 'n': n, 'hits': hits,
|
||||
'far': hits / n, 'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
|
||||
def joint_far(cos_vals, dh_vals, cos_k, dh_k):
|
||||
n = len(cos_vals)
|
||||
hits = int(((cos_vals > cos_k) & (dh_vals <= dh_k)).sum())
|
||||
lo, hi = wilson_ci(hits, n)
|
||||
return {'cos_k': float(cos_k), 'dh_k': int(dh_k),
|
||||
'n': n, 'hits': hits,
|
||||
'far': hits / n, 'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
|
||||
def cond_far(cos_vals, dh_vals, cos_k, dh_k):
|
||||
"""FAR(dh<=k | cos>cos_k)"""
|
||||
cos_mask = cos_vals > cos_k
|
||||
n_cond = int(cos_mask.sum())
|
||||
if n_cond == 0:
|
||||
return {'cos_k': float(cos_k), 'dh_k': int(dh_k),
|
||||
'n_cond': 0, 'hits': 0,
|
||||
'cond_far': None, 'ci95_lo': None, 'ci95_hi': None}
|
||||
hits = int(((dh_vals <= dh_k) & cos_mask).sum())
|
||||
lo, hi = wilson_ci(hits, n_cond)
|
||||
return {'cos_k': float(cos_k), 'dh_k': int(dh_k),
|
||||
'n_cond': n_cond, 'hits': hits,
|
||||
'cond_far': hits / n_cond, 'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
|
||||
def invert_far_target(curve_entries, target, key='far'):
|
||||
"""Return the entries bracketing the target FAR (linear scan)."""
|
||||
sorted_e = sorted(curve_entries, key=lambda e: e[key])
|
||||
for e in sorted_e:
|
||||
if e[key] <= target:
|
||||
best = e
|
||||
else:
|
||||
break
|
||||
return best if sorted_e and sorted_e[0][key] <= target else None
|
||||
|
||||
|
||||
def _fmt(x, fmt='.5f'):
|
||||
return 'None' if x is None else format(x, fmt)
|
||||
|
||||
|
||||
def run_scope(rows, scope_name, restrict_to_big4):
|
||||
print(f'\n== Scope: {scope_name} ==')
|
||||
scope_label, cos_vals, dh_vals = sample_inter_cpa_pairs(
|
||||
rows, N_PAIRS, SEED, restrict_to_big4=restrict_to_big4)
|
||||
print(f' Sampled {len(cos_vals):,} inter-CPA pairs')
|
||||
print(f' cos: mean={cos_vals.mean():.4f}, '
|
||||
f'median={np.median(cos_vals):.4f}, '
|
||||
f'std={cos_vals.std():.4f}')
|
||||
print(f' dh : mean={dh_vals.mean():.4f}, '
|
||||
f'median={np.median(dh_vals):.4f}, '
|
||||
f'std={dh_vals.std():.4f}')
|
||||
|
||||
cos_curve = [far_at_cos(cos_vals, k) for k in COS_GRID]
|
||||
dh_curve = [far_at_dh_le(dh_vals, k) for k in DH_GRID]
|
||||
joint_curve_95 = [joint_far(cos_vals, dh_vals, 0.95, k) for k in DH_GRID]
|
||||
cond_curve_95 = [cond_far(cos_vals, dh_vals, 0.95, k) for k in DH_GRID]
|
||||
|
||||
print('\n [Cos FAR sweep]')
|
||||
for e in cos_curve:
|
||||
print(f' cos > {e["k"]:.3f}: FAR={_fmt(e["far"])}, '
|
||||
f'CI=[{_fmt(e["ci95_lo"])}, {_fmt(e["ci95_hi"])}], '
|
||||
f'hits={e["hits"]}/{e["n"]}')
|
||||
|
||||
print('\n [dHash FAR sweep]')
|
||||
for e in dh_curve:
|
||||
print(f' dh <= {e["k"]:2d}: FAR={_fmt(e["far"])}, '
|
||||
f'CI=[{_fmt(e["ci95_lo"])}, {_fmt(e["ci95_hi"])}], '
|
||||
f'hits={e["hits"]}/{e["n"]}')
|
||||
|
||||
print('\n [Joint FAR (cos > 0.95 AND dh <= k)]')
|
||||
for e in joint_curve_95:
|
||||
print(f' dh <= {e["dh_k"]:2d}: FAR={_fmt(e["far"])}, '
|
||||
f'hits={e["hits"]}/{e["n"]}')
|
||||
|
||||
print('\n [Conditional FAR(dh <= k | cos > 0.95)]')
|
||||
for e in cond_curve_95:
|
||||
cf = e['cond_far']
|
||||
print(f' dh <= {e["dh_k"]:2d}: P(dh<=k | cos>0.95)='
|
||||
f'{_fmt(cf) if cf is not None else "n/a"}, '
|
||||
f'hits={e["hits"]}/{e["n_cond"]}')
|
||||
|
||||
targets = [0.005, 0.001, 0.0005, 0.0001]
|
||||
inv = {}
|
||||
for t in targets:
|
||||
inv[f'cos_far_<=_{t}'] = invert_far_target(cos_curve, t, 'far')
|
||||
inv[f'dh_far_<=_{t}'] = invert_far_target(dh_curve, t, 'far')
|
||||
inv[f'joint_at_cos95_far_<=_{t}'] = invert_far_target(
|
||||
joint_curve_95, t, 'far')
|
||||
|
||||
print('\n [Threshold inversion]')
|
||||
for tgt in targets:
|
||||
e = inv[f'cos_far_<=_{tgt}']
|
||||
if e is not None:
|
||||
print(f' FAR <= {tgt}: max cos threshold with FAR<=tgt is '
|
||||
f'cos > {e["k"]:.3f} (FAR={e["far"]:.5f})')
|
||||
e = inv[f'dh_far_<=_{tgt}']
|
||||
if e is not None:
|
||||
print(f' FAR <= {tgt}: max dh threshold with FAR<=tgt is '
|
||||
f'dh <= {e["k"]} (FAR={e["far"]:.5f})')
|
||||
e = inv[f'joint_at_cos95_far_<=_{tgt}']
|
||||
if e is not None:
|
||||
print(f' FAR <= {tgt}: under cos>0.95, max dh threshold '
|
||||
f'with joint FAR<=tgt is dh <= {e["dh_k"]} '
|
||||
f'(joint FAR={e["far"]:.5f})')
|
||||
|
||||
return {
|
||||
'scope': scope_label,
|
||||
'n_pairs': int(len(cos_vals)),
|
||||
'cos_summary': {
|
||||
'mean': float(cos_vals.mean()),
|
||||
'median': float(np.median(cos_vals)),
|
||||
'std': float(cos_vals.std()),
|
||||
'p99': float(np.percentile(cos_vals, 99)),
|
||||
'p999': float(np.percentile(cos_vals, 99.9)),
|
||||
'max': float(cos_vals.max()),
|
||||
},
|
||||
'dh_summary': {
|
||||
'mean': float(dh_vals.mean()),
|
||||
'median': float(np.median(dh_vals)),
|
||||
'std': float(dh_vals.std()),
|
||||
'p01': float(np.percentile(dh_vals, 1)),
|
||||
'p001': float(np.percentile(dh_vals, 0.1)),
|
||||
'min': int(dh_vals.min()),
|
||||
},
|
||||
'cos_far_curve': cos_curve,
|
||||
'dh_far_curve': dh_curve,
|
||||
'joint_far_at_cos95_curve': joint_curve_95,
|
||||
'cond_far_at_cos95_curve': cond_curve_95,
|
||||
'threshold_inversions': inv,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 40b: Inter-CPA FAR Sweep (cos + dHash, joint + marginal)')
|
||||
print('=' * 72)
|
||||
rows = load_signatures()
|
||||
print(f'\nLoaded {len(rows):,} signatures (full corpus)')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '40b',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_pairs_sampled': N_PAIRS,
|
||||
'seed': SEED,
|
||||
'note': ('Inter-CPA pair-level FAR sweep for cos and dHash. '
|
||||
'Anchor-based threshold derivation; replaces '
|
||||
'distributional path attacked in codex round-29.'),
|
||||
},
|
||||
'scopes': {},
|
||||
}
|
||||
|
||||
results['scopes']['big4_only'] = run_scope(
|
||||
rows, 'Big-4 only', restrict_to_big4=True)
|
||||
results['scopes']['all_firms'] = run_scope(
|
||||
rows, 'All firms', restrict_to_big4=False)
|
||||
|
||||
json_path = OUT / 'far_sweep_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = [
|
||||
'# Inter-CPA FAR Sweep (Script 40b)',
|
||||
'',
|
||||
f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Inter-CPA pair samples per scope: {N_PAIRS:,}; seed: {SEED}',
|
||||
'',
|
||||
('Anchor-based threshold derivation. For each scope (Big-4 only '
|
||||
'or all firms), sample random inter-CPA pairs and compute '
|
||||
'cosine + Hamming distance per pair. Report False Acceptance '
|
||||
'Rates (FAR) at various thresholds; invert FAR target to '
|
||||
'derive thresholds with empirical specificity guarantees.'),
|
||||
'',
|
||||
]
|
||||
|
||||
for scope in ['big4_only', 'all_firms']:
|
||||
s = results['scopes'][scope]
|
||||
md += [f'## Scope: {scope} ({s["n_pairs"]:,} pairs)', '',
|
||||
'### Cosine FAR curve', '',
|
||||
'| cos > k | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|']
|
||||
for e in s['cos_far_curve']:
|
||||
md.append(f'| {e["k"]:.3f} | {_fmt(e["far"])} | '
|
||||
f'[{_fmt(e["ci95_lo"])}, {_fmt(e["ci95_hi"])}] | '
|
||||
f'{e["hits"]:,} / {e["n"]:,} |')
|
||||
md += ['', '### dHash FAR curve', '',
|
||||
'| dh <= k | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|']
|
||||
for e in s['dh_far_curve']:
|
||||
md.append(f'| {e["k"]:2d} | {_fmt(e["far"])} | '
|
||||
f'[{_fmt(e["ci95_lo"])}, {_fmt(e["ci95_hi"])}] | '
|
||||
f'{e["hits"]:,} / {e["n"]:,} |')
|
||||
md += ['', '### Joint FAR (cos > 0.95 AND dh <= k)', '',
|
||||
'| dh <= k | Joint FAR | hits / n |',
|
||||
'|---|---|---|']
|
||||
for e in s['joint_far_at_cos95_curve']:
|
||||
md.append(f'| {e["dh_k"]:2d} | {_fmt(e["far"])} | '
|
||||
f'{e["hits"]:,} / {e["n"]:,} |')
|
||||
md += ['',
|
||||
'### Conditional FAR(dh <= k | cos > 0.95)',
|
||||
'',
|
||||
'Among inter-CPA pairs that already exceed cos > 0.95, '
|
||||
'what fraction also have dh <= k? This quantifies '
|
||||
"dHash's marginal specificity contribution given the cos "
|
||||
"gate is already applied.",
|
||||
'',
|
||||
'| dh <= k | Conditional FAR | hits / n_cond |',
|
||||
'|---|---|---|']
|
||||
for e in s['cond_far_at_cos95_curve']:
|
||||
cf = e['cond_far']
|
||||
md.append(f'| {e["dh_k"]:2d} | '
|
||||
f'{_fmt(cf) if cf is not None else "n/a"} | '
|
||||
f'{e["hits"]:,} / {e["n_cond"]:,} |')
|
||||
md += ['', '### Threshold inversion', '',
|
||||
'| FAR target | cos thresh | dh thresh | joint dh thresh '
|
||||
'(under cos>0.95) |',
|
||||
'|---|---|---|---|']
|
||||
for tgt in [0.005, 0.001, 0.0005, 0.0001]:
|
||||
e_c = s['threshold_inversions'].get(f'cos_far_<=_{tgt}')
|
||||
e_d = s['threshold_inversions'].get(f'dh_far_<=_{tgt}')
|
||||
e_j = s['threshold_inversions'].get(
|
||||
f'joint_at_cos95_far_<=_{tgt}')
|
||||
c_str = (f'cos > {e_c["k"]:.3f} (FAR={e_c["far"]:.5f})'
|
||||
if e_c else 'unachievable')
|
||||
d_str = (f'dh <= {e_d["k"]} (FAR={e_d["far"]:.5f})'
|
||||
if e_d else 'unachievable')
|
||||
j_str = (f'dh <= {e_j["dh_k"]} (FAR={e_j["far"]:.5f})'
|
||||
if e_j else 'unachievable')
|
||||
md.append(f'| {tgt} | {c_str} | {d_str} | {j_str} |')
|
||||
md.append('')
|
||||
|
||||
md += [
|
||||
'## Interpretation',
|
||||
'',
|
||||
('- The cosine FAR curve replicates and extends v3.x §IV-I '
|
||||
'Table X (which reported FAR=0.0005 at cos>0.95 from a '
|
||||
'similar but smaller-sample inter-CPA negative anchor).'),
|
||||
('- The dHash FAR curve is the v4 contribution: prior v3.x '
|
||||
'work used dh<=5 by convention without an empirical '
|
||||
'specificity derivation. This script derives a specificity '
|
||||
"target → dh threshold mapping."),
|
||||
('- The conditional FAR(dh<=k | cos>0.95) curve tells us '
|
||||
'whether dHash adds specificity given the cos gate. If the '
|
||||
"conditional FAR at dh<=5 is meaningfully lower than 1.0, "
|
||||
'dHash is providing additional specificity. If it is near '
|
||||
'1.0, dHash is largely redundant given cos>0.95 and the '
|
||||
'five-way rule should be simplified.'),
|
||||
('- Thresholds derived by inverting FAR targets are '
|
||||
'specificity-anchored operating points, not distributional '
|
||||
'antimodes. They are robust to the integer-mass-point and '
|
||||
'between-firm-composition artefacts identified in Scripts '
|
||||
'39b–39e.'),
|
||||
'',
|
||||
]
|
||||
md_path = OUT / 'far_sweep_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,568 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 43: Pool-Normalized Per-Signature FAR (anchor-based calibration)
|
||||
========================================================================
|
||||
Codex round-30 verdict on Script 40b: per-pair FAR (~0.00060 at
|
||||
cos>0.95) is NOT the per-signature classifier specificity. The
|
||||
deployed classifier uses max-cosine and min-dHash over each CPA's
|
||||
same-CPA pool, so the inter-CPA-equivalent specificity for a
|
||||
signature with pool size n is approximately 1 - (1 - pair_FAR)^n,
|
||||
which for Big-4 median pool ~280 is several percent, not 0.00014.
|
||||
|
||||
This script computes pool-normalized per-signature FAR by drawing,
|
||||
for each source signature s, a random inter-CPA candidate pool of
|
||||
size n_pool(s) (= same-CPA pool size of s), and computing the
|
||||
deployed descriptors against the random pool. The fraction of
|
||||
source signatures whose max-cosine exceeds k (and/or min-dHash <= k)
|
||||
is the per-signature FAR at that operating point.
|
||||
|
||||
We also report:
|
||||
- "Any-pair" joint FAR: max_cos > c AND min_dh <= d (descriptors
|
||||
may come from different candidates)
|
||||
- "Same-pair" joint FAR: at least one candidate has both
|
||||
cos > c AND dh <= d
|
||||
- Per-firm and pool-size-decile stratification
|
||||
- CPA-block bootstrap CI on key FAR points
|
||||
- Threshold inversion for target per-signature FAR
|
||||
|
||||
Inputs: full Big-4 sub-corpus (n=150,453 sigs / 468 CPAs).
|
||||
Random pool draws use one realisation per source signature, with
|
||||
seed control. CPA-block bootstrap quantifies sampling noise.
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/pool_normalized_far/
|
||||
pool_normalized_results.json
|
||||
pool_normalized_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/pool_normalized_far')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
SEED = 42
|
||||
BATCH = 200 # source signatures per batch
|
||||
N_BOOT_CPA = 1000 # CPA-block bootstrap replicates
|
||||
|
||||
COS_KS = [0.90, 0.92, 0.93, 0.94, 0.945, 0.95, 0.955, 0.96, 0.97, 0.98]
|
||||
DH_KS = [2, 3, 4, 5, 6, 8, 10, 15]
|
||||
|
||||
|
||||
def load_big4():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.signature_id, s.assigned_accountant, a.firm, s.source_pdf,
|
||||
s.feature_vector, s.dhash_vector
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.feature_vector IS NOT NULL
|
||||
AND s.dhash_vector IS NOT NULL
|
||||
AND a.firm IN (?, ?, ?, ?)
|
||||
''', BIG4)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def hamming_vec(query_bytes, cand_bytes_array):
|
||||
"""Hamming between one 8-byte hash and an array of 8-byte hashes."""
|
||||
q = int.from_bytes(query_bytes, 'big')
|
||||
out = np.empty(len(cand_bytes_array), dtype=np.int32)
|
||||
for i, c in enumerate(cand_bytes_array):
|
||||
c_int = int.from_bytes(c, 'big')
|
||||
out[i] = (q ^ c_int).bit_count()
|
||||
return out
|
||||
|
||||
|
||||
def wilson_ci(k, n, z=1.96):
|
||||
if n == 0:
|
||||
return (None, None)
|
||||
phat = k / n
|
||||
denom = 1 + z * z / n
|
||||
centre = (phat + z * z / (2 * n)) / denom
|
||||
half = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
|
||||
return (max(0.0, centre - half), min(1.0, centre + half))
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 43: Pool-Normalized Per-Signature FAR')
|
||||
print('=' * 72)
|
||||
rows = load_big4()
|
||||
n_sigs = len(rows)
|
||||
print(f'\nLoaded {n_sigs:,} Big-4 signatures')
|
||||
|
||||
# Build index arrays
|
||||
sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
|
||||
cpas = np.array([r[1] for r in rows])
|
||||
firms = np.array([ALIAS[r[2]] for r in rows])
|
||||
source_pdfs = np.array([r[3] for r in rows])
|
||||
|
||||
# Feature matrix
|
||||
feats = np.stack([np.frombuffer(r[4], dtype=np.float32)
|
||||
for r in rows]).astype(np.float32)
|
||||
print(f' Feature matrix: {feats.shape}, '
|
||||
f'{feats.nbytes / 1e9:.2f} GB')
|
||||
# L2-normalize
|
||||
norms = np.linalg.norm(feats, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1.0
|
||||
feats = feats / norms
|
||||
|
||||
# dHash bytes
|
||||
dhashes = [r[5] for r in rows]
|
||||
|
||||
# CPA → indices
|
||||
cpa_to_idx = defaultdict(list)
|
||||
for i, c in enumerate(cpas):
|
||||
cpa_to_idx[c].append(i)
|
||||
cpa_to_idx = {c: np.array(v, dtype=np.int64)
|
||||
for c, v in cpa_to_idx.items()}
|
||||
n_cpas = len(cpa_to_idx)
|
||||
pool_sizes = {c: len(v) - 1 for c, v in cpa_to_idx.items()}
|
||||
print(f' CPAs: {n_cpas}; pool-size summary: '
|
||||
f'min={min(pool_sizes.values())}, '
|
||||
f'median={int(np.median(list(pool_sizes.values())))}, '
|
||||
f'max={max(pool_sizes.values())}')
|
||||
|
||||
# Pre-compute: for sampling non-same-CPA candidates, we need fast
|
||||
# index sampling. The total available pool for each source sig is
|
||||
# all_indices \ same_cpa_indices.
|
||||
all_idx = np.arange(n_sigs, dtype=np.int64)
|
||||
|
||||
# ── Per-source-signature simulation ─────────────────────
|
||||
print('\nSimulating per-source-signature inter-CPA-equivalent pool...')
|
||||
rng = np.random.default_rng(SEED)
|
||||
|
||||
# Per-signature stored statistics
|
||||
max_cos = np.zeros(n_sigs, dtype=np.float32)
|
||||
min_dh = np.zeros(n_sigs, dtype=np.int32)
|
||||
cos_at_min_dh = np.zeros(n_sigs, dtype=np.float32)
|
||||
dh_at_max_cos = np.zeros(n_sigs, dtype=np.int32)
|
||||
pool_size_arr = np.zeros(n_sigs, dtype=np.int32)
|
||||
|
||||
# For each source signature, we also record indicator for same-pair
|
||||
# joint event at (cos>0.95, dh<=5) -- the headline operational rule.
|
||||
# This requires keeping per-signature any() flag for that pair.
|
||||
headline_same_pair_95_5 = np.zeros(n_sigs, dtype=bool)
|
||||
headline_same_pair_95_4 = np.zeros(n_sigs, dtype=bool)
|
||||
headline_same_pair_95_3 = np.zeros(n_sigs, dtype=bool)
|
||||
|
||||
# process batches of source signatures
|
||||
for batch_start in range(0, n_sigs, BATCH):
|
||||
batch_end = min(batch_start + BATCH, n_sigs)
|
||||
if batch_start % 5000 == 0:
|
||||
pct = batch_start / n_sigs * 100
|
||||
print(f' {batch_start:,}/{n_sigs:,} ({pct:.1f}%)')
|
||||
|
||||
for si in range(batch_start, batch_end):
|
||||
s_cpa = cpas[si]
|
||||
n_pool = pool_sizes[s_cpa]
|
||||
pool_size_arr[si] = n_pool
|
||||
|
||||
if n_pool <= 0:
|
||||
max_cos[si] = 0.0
|
||||
min_dh[si] = 64
|
||||
continue
|
||||
|
||||
# Sample n_pool candidates from non-same-CPA indices
|
||||
same_cpa = cpa_to_idx[s_cpa]
|
||||
# Use random.choice over all_idx excluding same_cpa is slow;
|
||||
# instead reject-sample from all_idx
|
||||
need = n_pool
|
||||
cand_indices = []
|
||||
attempts = 0
|
||||
while need > 0 and attempts < 10:
|
||||
draw = rng.choice(n_sigs, size=need * 2, replace=True)
|
||||
# filter out same_cpa
|
||||
same_mask = np.isin(draw, same_cpa)
|
||||
ok = draw[~same_mask]
|
||||
cand_indices.extend(ok[:need].tolist())
|
||||
need -= len(ok[:need])
|
||||
attempts += 1
|
||||
if need > 0:
|
||||
# fallback: deterministic sample without same-CPA
|
||||
pool_mask = np.ones(n_sigs, dtype=bool)
|
||||
pool_mask[same_cpa] = False
|
||||
pool_idx = all_idx[pool_mask]
|
||||
fb = rng.choice(pool_idx, size=need, replace=False)
|
||||
cand_indices.extend(fb.tolist())
|
||||
cand_indices = np.array(cand_indices[:n_pool], dtype=np.int64)
|
||||
|
||||
# Cosine: source feat @ cand feats
|
||||
cos_vec = feats[cand_indices] @ feats[si]
|
||||
# dHash
|
||||
dh_vec = hamming_vec(dhashes[si],
|
||||
[dhashes[c] for c in cand_indices])
|
||||
|
||||
mc_idx = int(np.argmax(cos_vec))
|
||||
md_idx = int(np.argmin(dh_vec))
|
||||
max_cos[si] = float(cos_vec[mc_idx])
|
||||
min_dh[si] = int(dh_vec[md_idx])
|
||||
dh_at_max_cos[si] = int(dh_vec[mc_idx])
|
||||
cos_at_min_dh[si] = float(cos_vec[md_idx])
|
||||
|
||||
# Same-pair joint indicators
|
||||
cos_gt = cos_vec > 0.95
|
||||
if cos_gt.any():
|
||||
dh_under_5 = dh_vec <= 5
|
||||
dh_under_4 = dh_vec <= 4
|
||||
dh_under_3 = dh_vec <= 3
|
||||
headline_same_pair_95_5[si] = bool((cos_gt & dh_under_5).any())
|
||||
headline_same_pair_95_4[si] = bool((cos_gt & dh_under_4).any())
|
||||
headline_same_pair_95_3[si] = bool((cos_gt & dh_under_3).any())
|
||||
|
||||
print(f' Done.')
|
||||
|
||||
# ── Aggregate ──────────────────────────────────────────
|
||||
print('\nAggregating per-signature FAR statistics...')
|
||||
|
||||
def far_marginal_cos(k):
|
||||
hits = int((max_cos > k).sum())
|
||||
lo, hi = wilson_ci(hits, n_sigs)
|
||||
return {'k': k, 'hits': hits, 'n': n_sigs,
|
||||
'far': hits / n_sigs,
|
||||
'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
def far_marginal_dh(k):
|
||||
hits = int((min_dh <= k).sum())
|
||||
lo, hi = wilson_ci(hits, n_sigs)
|
||||
return {'k': k, 'hits': hits, 'n': n_sigs,
|
||||
'far': hits / n_sigs,
|
||||
'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
def far_any_pair_joint(cos_k, dh_k):
|
||||
hits = int(((max_cos > cos_k) & (min_dh <= dh_k)).sum())
|
||||
lo, hi = wilson_ci(hits, n_sigs)
|
||||
return {'cos_k': cos_k, 'dh_k': dh_k,
|
||||
'hits': hits, 'n': n_sigs,
|
||||
'far': hits / n_sigs,
|
||||
'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
def far_same_pair_joint(cos_k, dh_k, indicator):
|
||||
hits = int(indicator.sum())
|
||||
lo, hi = wilson_ci(hits, n_sigs)
|
||||
return {'cos_k': cos_k, 'dh_k': dh_k,
|
||||
'hits': hits, 'n': n_sigs,
|
||||
'far': hits / n_sigs,
|
||||
'ci95_lo': lo, 'ci95_hi': hi}
|
||||
|
||||
cos_curve = [far_marginal_cos(k) for k in COS_KS]
|
||||
dh_curve = [far_marginal_dh(k) for k in DH_KS]
|
||||
any_pair_curve = [far_any_pair_joint(0.95, k) for k in DH_KS]
|
||||
same_pair_curve = [
|
||||
far_same_pair_joint(0.95, 5, headline_same_pair_95_5),
|
||||
far_same_pair_joint(0.95, 4, headline_same_pair_95_4),
|
||||
far_same_pair_joint(0.95, 3, headline_same_pair_95_3),
|
||||
]
|
||||
|
||||
print('\n[Per-signature marginal cos FAR]')
|
||||
for e in cos_curve:
|
||||
print(f' max-cos > {e["k"]:.3f}: FAR={e["far"]:.4f}, '
|
||||
f'CI=[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}], '
|
||||
f'hits={e["hits"]}/{e["n"]:,}')
|
||||
|
||||
print('\n[Per-signature marginal dh FAR]')
|
||||
for e in dh_curve:
|
||||
print(f' min-dh <= {e["k"]:2d}: FAR={e["far"]:.4f}, '
|
||||
f'CI=[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}], '
|
||||
f'hits={e["hits"]}/{e["n"]:,}')
|
||||
|
||||
print('\n[Per-signature any-pair joint FAR (cos>0.95 AND dh<=k)]')
|
||||
for e in any_pair_curve:
|
||||
print(f' dh <= {e["dh_k"]:2d}: FAR={e["far"]:.4f}, '
|
||||
f'hits={e["hits"]}/{e["n"]:,}')
|
||||
|
||||
print('\n[Per-signature SAME-pair joint FAR]')
|
||||
for e in same_pair_curve:
|
||||
print(f' cos>0.95 AND dh<={e["dh_k"]}: FAR={e["far"]:.4f}, '
|
||||
f'CI=[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}], '
|
||||
f'hits={e["hits"]}/{e["n"]:,}')
|
||||
|
||||
# Per-firm and per-pool-decile stratification
|
||||
print('\n[Per-firm headline FAR (any-pair, cos>0.95 AND dh<=5)]')
|
||||
per_firm = {}
|
||||
for f in sorted(set(firms)):
|
||||
mask = firms == f
|
||||
n_f = int(mask.sum())
|
||||
hits_anypair = int(((max_cos[mask] > 0.95) &
|
||||
(min_dh[mask] <= 5)).sum())
|
||||
hits_samepair = int(headline_same_pair_95_5[mask].sum())
|
||||
per_firm[f] = {
|
||||
'n': n_f,
|
||||
'any_pair_far': hits_anypair / n_f,
|
||||
'same_pair_far': hits_samepair / n_f,
|
||||
}
|
||||
print(f' {f}: n={n_f:,} '
|
||||
f'any-pair FAR={hits_anypair/n_f:.4f}, '
|
||||
f'same-pair FAR={hits_samepair/n_f:.4f}')
|
||||
|
||||
print('\n[Pool-size decile × headline FAR]')
|
||||
pool_arr = pool_size_arr
|
||||
deciles = np.percentile(pool_arr, np.arange(0, 110, 10))
|
||||
per_decile = {}
|
||||
for d in range(10):
|
||||
lo, hi = deciles[d], deciles[d + 1]
|
||||
mask = (pool_arr >= lo) & (pool_arr <= hi if d == 9
|
||||
else pool_arr < hi)
|
||||
n_d = int(mask.sum())
|
||||
if n_d == 0:
|
||||
continue
|
||||
hits_any = int(((max_cos[mask] > 0.95) &
|
||||
(min_dh[mask] <= 5)).sum())
|
||||
hits_same = int(headline_same_pair_95_5[mask].sum())
|
||||
per_decile[f'decile_{d+1}'] = {
|
||||
'pool_range': [float(lo), float(hi)],
|
||||
'n': n_d,
|
||||
'any_pair_far': hits_any / n_d,
|
||||
'same_pair_far': hits_same / n_d,
|
||||
}
|
||||
print(f' Decile {d+1} (pool {lo:.0f}-{hi:.0f}): n={n_d:,} '
|
||||
f'any-FAR={hits_any/n_d:.4f}, '
|
||||
f'same-FAR={hits_same/n_d:.4f}')
|
||||
|
||||
# CPA bootstrap on headline (cos>0.95 AND dh<=5, same-pair)
|
||||
print(f'\n[CPA-block bootstrap {N_BOOT_CPA} replicates]')
|
||||
rng_b = np.random.default_rng(SEED + 1)
|
||||
all_cpa_list = list(cpa_to_idx.keys())
|
||||
boot_anypair = np.zeros(N_BOOT_CPA)
|
||||
boot_samepair = np.zeros(N_BOOT_CPA)
|
||||
for b in range(N_BOOT_CPA):
|
||||
cpas_b = rng_b.choice(all_cpa_list, size=len(all_cpa_list),
|
||||
replace=True)
|
||||
idx_b = np.concatenate([cpa_to_idx[c] for c in cpas_b])
|
||||
n_b = len(idx_b)
|
||||
boot_anypair[b] = ((max_cos[idx_b] > 0.95) &
|
||||
(min_dh[idx_b] <= 5)).mean()
|
||||
boot_samepair[b] = headline_same_pair_95_5[idx_b].mean()
|
||||
boot_anypair_ci = (float(np.percentile(boot_anypair, 2.5)),
|
||||
float(np.percentile(boot_anypair, 97.5)))
|
||||
boot_samepair_ci = (float(np.percentile(boot_samepair, 2.5)),
|
||||
float(np.percentile(boot_samepair, 97.5)))
|
||||
print(f' any-pair FAR boot mean={boot_anypair.mean():.4f}, '
|
||||
f'95% CI={boot_anypair_ci}')
|
||||
print(f' same-pair FAR boot mean={boot_samepair.mean():.4f}, '
|
||||
f'95% CI={boot_samepair_ci}')
|
||||
|
||||
# Document-level aggregation: a document is flagged if any of its
|
||||
# signatures has max_cos > 0.95 AND min_dh <= 5 (the worst-case rule)
|
||||
print('\n[Document-level aggregation]')
|
||||
doc_idx = defaultdict(list)
|
||||
for i, pdf in enumerate(source_pdfs):
|
||||
doc_idx[pdf].append(i)
|
||||
n_docs = len(doc_idx)
|
||||
doc_anypair_flag = 0
|
||||
doc_samepair_flag = 0
|
||||
for pdf, idxs in doc_idx.items():
|
||||
idxs_a = np.array(idxs, dtype=np.int64)
|
||||
if ((max_cos[idxs_a] > 0.95) & (min_dh[idxs_a] <= 5)).any():
|
||||
doc_anypair_flag += 1
|
||||
if headline_same_pair_95_5[idxs_a].any():
|
||||
doc_samepair_flag += 1
|
||||
print(f' n_documents: {n_docs:,}')
|
||||
print(f' doc-level any-pair FAR (any sig flagged) = '
|
||||
f'{doc_anypair_flag/n_docs:.4f} ({doc_anypair_flag}/{n_docs})')
|
||||
print(f' doc-level same-pair FAR = '
|
||||
f'{doc_samepair_flag/n_docs:.4f} ({doc_samepair_flag}/{n_docs})')
|
||||
|
||||
# Threshold inversion: find cos and dh thresholds that hit per-sig
|
||||
# FAR targets at the marginal level
|
||||
print('\n[Per-signature marginal threshold inversion]')
|
||||
inversions = {}
|
||||
for tgt in [0.10, 0.05, 0.02, 0.01, 0.005]:
|
||||
c_pick = None
|
||||
for e in cos_curve:
|
||||
if e['far'] <= tgt:
|
||||
c_pick = e
|
||||
break
|
||||
d_pick = None
|
||||
for e in dh_curve:
|
||||
if e['far'] <= tgt:
|
||||
d_pick = e
|
||||
break
|
||||
any_pick = None
|
||||
for e in any_pair_curve:
|
||||
if e['far'] <= tgt:
|
||||
any_pick = e
|
||||
break
|
||||
same_pick = None
|
||||
for e in same_pair_curve:
|
||||
if e['far'] <= tgt:
|
||||
same_pick = e
|
||||
break
|
||||
inversions[f'per_sig_far_<=_{tgt}'] = {
|
||||
'marginal_cos': c_pick, 'marginal_dh': d_pick,
|
||||
'any_pair_joint': any_pick, 'same_pair_joint': same_pick,
|
||||
}
|
||||
print(f' per-sig FAR <= {tgt}:')
|
||||
if c_pick:
|
||||
print(f' marginal cos: cos > {c_pick["k"]} '
|
||||
f'(FAR={c_pick["far"]:.4f})')
|
||||
if d_pick:
|
||||
print(f' marginal dh: dh <= {d_pick["k"]} '
|
||||
f'(FAR={d_pick["far"]:.4f})')
|
||||
if any_pick:
|
||||
print(f' any-pair joint: dh <= {any_pick["dh_k"]} '
|
||||
f'(FAR={any_pick["far"]:.4f})')
|
||||
if same_pick:
|
||||
print(f' same-pair joint: dh <= {same_pick["dh_k"]} '
|
||||
f'(FAR={same_pick["far"]:.4f})')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '43',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_signatures': n_sigs,
|
||||
'n_cpas': n_cpas,
|
||||
'n_boot_cpa': N_BOOT_CPA,
|
||||
'seed': SEED,
|
||||
'note': ('Pool-normalized per-signature FAR. For each '
|
||||
'source signature, simulate inter-CPA candidate '
|
||||
'pool of size n_pool(s); compute deployed max-cos '
|
||||
'and min-dh; aggregate per-signature FAR.'),
|
||||
},
|
||||
'marginal_cos_curve': cos_curve,
|
||||
'marginal_dh_curve': dh_curve,
|
||||
'any_pair_joint_curve': any_pair_curve,
|
||||
'same_pair_joint': same_pair_curve,
|
||||
'per_firm_headline': per_firm,
|
||||
'per_pool_decile_headline': per_decile,
|
||||
'cpa_bootstrap_headline': {
|
||||
'any_pair_mean': float(boot_anypair.mean()),
|
||||
'any_pair_ci95': boot_anypair_ci,
|
||||
'same_pair_mean': float(boot_samepair.mean()),
|
||||
'same_pair_ci95': boot_samepair_ci,
|
||||
},
|
||||
'document_level_headline': {
|
||||
'n_docs': n_docs,
|
||||
'any_pair_far': doc_anypair_flag / n_docs,
|
||||
'same_pair_far': doc_samepair_flag / n_docs,
|
||||
},
|
||||
'threshold_inversions': inversions,
|
||||
}
|
||||
|
||||
json_path = OUT / 'pool_normalized_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = ['# Pool-Normalized Per-Signature FAR (Script 43)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
(f'Big-4 source signatures: {n_sigs:,} across {n_cpas} CPAs; '
|
||||
f'pool-size median={int(np.median(list(pool_sizes.values())))}, '
|
||||
f'max={max(pool_sizes.values())}'),
|
||||
(f'CPA-block bootstrap: {N_BOOT_CPA} replicates. Per source '
|
||||
'signature, one realisation of n_pool(s)-sized random '
|
||||
'inter-CPA candidate pool.'),
|
||||
'',
|
||||
'## Headline (cos>0.95 AND dh<=5)',
|
||||
'',
|
||||
'| Variant | per-sig FAR | 95% Wilson CI | CPA-bootstrap 95% CI |',
|
||||
'|---|---|---|---|']
|
||||
md.append(f'| any-pair joint | '
|
||||
f'{((max_cos > 0.95) & (min_dh <= 5)).mean():.4f} | '
|
||||
f'see JSON | [{boot_anypair_ci[0]:.4f}, '
|
||||
f'{boot_anypair_ci[1]:.4f}] |')
|
||||
md.append(f'| same-pair joint | '
|
||||
f'{headline_same_pair_95_5.mean():.4f} | '
|
||||
f'see JSON | [{boot_samepair_ci[0]:.4f}, '
|
||||
f'{boot_samepair_ci[1]:.4f}] |')
|
||||
md += [
|
||||
'',
|
||||
'## Marginal cos FAR (per-signature)',
|
||||
'',
|
||||
'| max-cos > k | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|']
|
||||
for e in cos_curve:
|
||||
md.append(f'| {e["k"]} | {e["far"]:.4f} | '
|
||||
f'[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}] | '
|
||||
f'{e["hits"]} / {e["n"]:,} |')
|
||||
md += ['', '## Marginal dh FAR (per-signature)', '',
|
||||
'| min-dh <= k | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|']
|
||||
for e in dh_curve:
|
||||
md.append(f'| {e["k"]} | {e["far"]:.4f} | '
|
||||
f'[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}] | '
|
||||
f'{e["hits"]} / {e["n"]:,} |')
|
||||
md += ['',
|
||||
'## Any-pair joint FAR (cos>0.95 AND dh<=k)',
|
||||
'',
|
||||
'| dh <= k | FAR | hits / n |',
|
||||
'|---|---|---|']
|
||||
for e in any_pair_curve:
|
||||
md.append(f'| {e["dh_k"]} | {e["far"]:.4f} | '
|
||||
f'{e["hits"]} / {e["n"]:,} |')
|
||||
md += ['',
|
||||
'## Same-pair joint FAR (one candidate satisfies both)',
|
||||
'',
|
||||
'| cos>0.95 AND dh<=k | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|']
|
||||
for e in same_pair_curve:
|
||||
md.append(f'| dh <= {e["dh_k"]} | {e["far"]:.4f} | '
|
||||
f'[{e["ci95_lo"]:.4f}, {e["ci95_hi"]:.4f}] | '
|
||||
f'{e["hits"]} / {e["n"]:,} |')
|
||||
md += ['', '## Per-firm headline', '',
|
||||
'| Firm | n | any-pair FAR | same-pair FAR |',
|
||||
'|---|---|---|---|']
|
||||
for f, s in per_firm.items():
|
||||
md.append(f'| {f} | {s["n"]:,} | {s["any_pair_far"]:.4f} | '
|
||||
f'{s["same_pair_far"]:.4f} |')
|
||||
md += ['', '## Per-pool-decile headline', '',
|
||||
'| Decile | pool range | n | any-pair FAR | same-pair FAR |',
|
||||
'|---|---|---|---|---|']
|
||||
for k, s in per_decile.items():
|
||||
md.append(f'| {k} | {s["pool_range"][0]:.0f}-'
|
||||
f'{s["pool_range"][1]:.0f} | {s["n"]:,} | '
|
||||
f'{s["any_pair_far"]:.4f} | '
|
||||
f'{s["same_pair_far"]:.4f} |')
|
||||
md += ['', '## Document-level',
|
||||
'',
|
||||
f'- n_documents: {n_docs:,}',
|
||||
f'- any-pair FAR (any sig flagged): '
|
||||
f'{doc_anypair_flag/n_docs:.4f} '
|
||||
f'({doc_anypair_flag}/{n_docs})',
|
||||
f'- same-pair FAR: {doc_samepair_flag/n_docs:.4f} '
|
||||
f'({doc_samepair_flag}/{n_docs})',
|
||||
'',
|
||||
'## Threshold inversion (per-signature FAR targets)',
|
||||
'',
|
||||
'| target | marginal cos | marginal dh | any-pair joint '
|
||||
'| same-pair joint |',
|
||||
'|---|---|---|---|---|']
|
||||
for tgt in [0.10, 0.05, 0.02, 0.01, 0.005]:
|
||||
inv = inversions[f'per_sig_far_<=_{tgt}']
|
||||
c = inv['marginal_cos']
|
||||
d = inv['marginal_dh']
|
||||
a = inv['any_pair_joint']
|
||||
s = inv['same_pair_joint']
|
||||
cs = (f'cos > {c["k"]} (FAR={c["far"]:.4f})'
|
||||
if c else 'unachievable')
|
||||
ds = (f'dh <= {d["k"]} (FAR={d["far"]:.4f})'
|
||||
if d else 'unachievable')
|
||||
as_ = (f'dh <= {a["dh_k"]} (FAR={a["far"]:.4f})'
|
||||
if a else 'unachievable')
|
||||
ss = (f'dh <= {s["dh_k"]} (FAR={s["far"]:.4f})'
|
||||
if s else 'unachievable')
|
||||
md.append(f'| {tgt} | {cs} | {ds} | {as_} | {ss} |')
|
||||
md.append('')
|
||||
|
||||
md_path = OUT / 'pool_normalized_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,437 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 44: Firm-Matched-Pool Regression + Source × Candidate Firm Hit Matrix
|
||||
=============================================================================
|
||||
Codex round-31 critique: Script 43 showed Firm A per-signature FAR is
|
||||
20.18% vs B/C/D 0.19-0.51%, but Codex's pool-size-only expectation
|
||||
gives Firm A ~7%, B/C/D 6-9%. So Firm A excess is NOT pool-size
|
||||
confounded -- there is real firm heterogeneity. The paper must
|
||||
defend this against the reviewer attack "Firm A is high because of
|
||||
pool size."
|
||||
|
||||
This script:
|
||||
1. Logistic regression of per-signature hit (any-pair, cos>0.95
|
||||
AND dh<=5) on (firm dummies + log(pool_size)) to quantify the
|
||||
residual firm effect after pool-size adjustment.
|
||||
2. Pool-size stratified per-firm FAR within common deciles, to
|
||||
verify the firm gap survives within matched pool sizes.
|
||||
3. Source-firm × candidate-firm hit matrix: where do the false
|
||||
accepts originate? Same firm? Different firm? Big-4 vs non-Big-4
|
||||
candidates?
|
||||
|
||||
Loads Script 43's per-signature output via re-simulation (faster
|
||||
than re-loading reports). One realisation per source signature,
|
||||
seed=42 (matching Script 43).
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/firm_matched_pool/
|
||||
firm_matched_pool_results.json
|
||||
firm_matched_pool_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/firm_matched_pool')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
SEED = 42
|
||||
|
||||
|
||||
def hamming_vec(query_bytes, cand_bytes_list):
|
||||
q = int.from_bytes(query_bytes, 'big')
|
||||
out = np.empty(len(cand_bytes_list), dtype=np.int32)
|
||||
for i, c in enumerate(cand_bytes_list):
|
||||
out[i] = (q ^ int.from_bytes(c, 'big')).bit_count()
|
||||
return out
|
||||
|
||||
|
||||
def load_all_signatures():
|
||||
"""Load all signatures (Big-4 + non-Big-4) for cross-firm hit matrix."""
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.signature_id, s.assigned_accountant, a.firm,
|
||||
s.feature_vector, s.dhash_vector
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.feature_vector IS NOT NULL
|
||||
AND s.dhash_vector IS NOT NULL
|
||||
AND a.firm IS NOT NULL
|
||||
''')
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def logistic_fit(X, y, max_iter=200, lr=0.3, l2=0.0):
|
||||
"""Simple Newton-Raphson logistic regression. Returns betas, SEs."""
|
||||
n, k = X.shape
|
||||
beta = np.zeros(k)
|
||||
for it in range(max_iter):
|
||||
eta = X @ beta
|
||||
eta = np.clip(eta, -30, 30)
|
||||
p = 1.0 / (1.0 + np.exp(-eta))
|
||||
# Add l2 reg
|
||||
grad = X.T @ (y - p) - l2 * beta
|
||||
W = p * (1 - p)
|
||||
H = -(X.T * W) @ X - l2 * np.eye(k)
|
||||
try:
|
||||
delta = np.linalg.solve(H, grad)
|
||||
except np.linalg.LinAlgError:
|
||||
delta = lr * grad
|
||||
new_beta = beta - delta
|
||||
if np.max(np.abs(new_beta - beta)) < 1e-8:
|
||||
beta = new_beta
|
||||
break
|
||||
beta = new_beta
|
||||
# Standard errors from inverse Fisher information
|
||||
eta = np.clip(X @ beta, -30, 30)
|
||||
p = 1.0 / (1.0 + np.exp(-eta))
|
||||
W = p * (1 - p)
|
||||
info = (X.T * W) @ X + l2 * np.eye(k)
|
||||
cov = np.linalg.inv(info)
|
||||
se = np.sqrt(np.diag(cov))
|
||||
return beta, se
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 44: Firm-Matched-Pool Regression + Cross-Firm Hit Matrix')
|
||||
print('=' * 72)
|
||||
rows = load_all_signatures()
|
||||
n_total = len(rows)
|
||||
print(f'\nLoaded {n_total:,} signatures (Big-4 + non-Big-4)')
|
||||
|
||||
sig_ids = np.array([r[0] for r in rows], dtype=np.int64)
|
||||
cpas = np.array([r[1] for r in rows])
|
||||
firms_raw = np.array([r[2] for r in rows])
|
||||
firms = np.array([ALIAS.get(f, f) for f in firms_raw])
|
||||
is_big4 = np.isin(firms_raw, BIG4)
|
||||
print(f' Big-4 sigs: {is_big4.sum():,}; '
|
||||
f'non-Big-4 sigs: {(~is_big4).sum():,}')
|
||||
|
||||
feats = np.stack([np.frombuffer(r[3], dtype=np.float32)
|
||||
for r in rows]).astype(np.float32)
|
||||
norms = np.linalg.norm(feats, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1.0
|
||||
feats = feats / norms
|
||||
dhashes = [r[4] for r in rows]
|
||||
|
||||
cpa_to_idx = defaultdict(list)
|
||||
for i, c in enumerate(cpas):
|
||||
cpa_to_idx[c].append(i)
|
||||
cpa_to_idx = {c: np.array(v, dtype=np.int64)
|
||||
for c, v in cpa_to_idx.items()}
|
||||
pool_sizes = {c: len(v) - 1 for c, v in cpa_to_idx.items()}
|
||||
|
||||
# ── Per-source-sig simulation for Big-4 sources (with candidate
|
||||
# drawn from ALL non-same-CPA, including non-Big-4 sigs) ──
|
||||
print('\nSimulating per-Big-4-source-signature inter-CPA pool '
|
||||
'(candidate from all non-same-CPA sigs)...')
|
||||
rng = np.random.default_rng(SEED)
|
||||
big4_idx = np.where(is_big4)[0]
|
||||
|
||||
n_b = len(big4_idx)
|
||||
src_firm = np.empty(n_b, dtype=object)
|
||||
pool_size_arr = np.zeros(n_b, dtype=np.int32)
|
||||
hit_any_pair = np.zeros(n_b, dtype=bool)
|
||||
hit_same_pair = np.zeros(n_b, dtype=bool)
|
||||
# For each hit, record candidate firm and big4-or-not
|
||||
cand_firm_anypair_max_cos = np.empty(n_b, dtype=object)
|
||||
cand_firm_anypair_min_dh = np.empty(n_b, dtype=object)
|
||||
cand_firm_samepair = np.empty(n_b, dtype=object)
|
||||
|
||||
for bi, si in enumerate(big4_idx):
|
||||
if bi % 5000 == 0:
|
||||
print(f' {bi:,}/{n_b:,} ({bi/n_b*100:.1f}%)')
|
||||
s_cpa = cpas[si]
|
||||
n_pool = pool_sizes[s_cpa]
|
||||
pool_size_arr[bi] = n_pool
|
||||
src_firm[bi] = firms[si]
|
||||
if n_pool <= 0:
|
||||
continue
|
||||
# Sample n_pool candidates from all non-same-CPA signatures
|
||||
same_cpa = cpa_to_idx[s_cpa]
|
||||
need = n_pool
|
||||
cand_indices = []
|
||||
attempts = 0
|
||||
while need > 0 and attempts < 10:
|
||||
draw = rng.choice(n_total, size=need * 2, replace=True)
|
||||
same_mask = np.isin(draw, same_cpa)
|
||||
ok = draw[~same_mask]
|
||||
cand_indices.extend(ok[:need].tolist())
|
||||
need -= len(ok[:need])
|
||||
attempts += 1
|
||||
if need > 0:
|
||||
pool_mask = np.ones(n_total, dtype=bool)
|
||||
pool_mask[same_cpa] = False
|
||||
pool_idx = np.where(pool_mask)[0]
|
||||
fb = rng.choice(pool_idx, size=need, replace=False)
|
||||
cand_indices.extend(fb.tolist())
|
||||
cand_indices = np.array(cand_indices[:n_pool], dtype=np.int64)
|
||||
|
||||
cos_vec = feats[cand_indices] @ feats[si]
|
||||
dh_vec = hamming_vec(dhashes[si],
|
||||
[dhashes[c] for c in cand_indices])
|
||||
|
||||
mc_idx = int(np.argmax(cos_vec))
|
||||
md_idx = int(np.argmin(dh_vec))
|
||||
max_cos_v = float(cos_vec[mc_idx])
|
||||
min_dh_v = int(dh_vec[md_idx])
|
||||
|
||||
cos_gt = max_cos_v > 0.95
|
||||
dh_le = min_dh_v <= 5
|
||||
if cos_gt and dh_le:
|
||||
hit_any_pair[bi] = True
|
||||
cand_firm_anypair_max_cos[bi] = firms[cand_indices[mc_idx]]
|
||||
cand_firm_anypair_min_dh[bi] = firms[cand_indices[md_idx]]
|
||||
# Same-pair indicator
|
||||
same_pair_mask = (cos_vec > 0.95) & (dh_vec <= 5)
|
||||
if same_pair_mask.any():
|
||||
hit_same_pair[bi] = True
|
||||
# pick first same-pair hit's firm
|
||||
first_idx = int(np.argmax(same_pair_mask))
|
||||
cand_firm_samepair[bi] = firms[cand_indices[first_idx]]
|
||||
|
||||
print(' Done.')
|
||||
|
||||
# ── Logistic regression: hit ~ firm + log(pool_size) ──
|
||||
print('\n[Logistic regression] hit (any-pair, cos>0.95 AND dh<=5) ~ '
|
||||
'firm + log(pool_size)')
|
||||
# Design matrix: intercept, firm B/C/D dummies (Firm A reference),
|
||||
# log(pool_size)
|
||||
has_pool = pool_size_arr > 0
|
||||
y = hit_any_pair[has_pool].astype(np.float64)
|
||||
f_arr = src_firm[has_pool]
|
||||
log_pool = np.log(pool_size_arr[has_pool].astype(np.float64))
|
||||
log_pool = (log_pool - log_pool.mean()) # centered for numerical
|
||||
intercept = np.ones(y.shape)
|
||||
is_B = (f_arr == 'Firm B').astype(np.float64)
|
||||
is_C = (f_arr == 'Firm C').astype(np.float64)
|
||||
is_D = (f_arr == 'Firm D').astype(np.float64)
|
||||
X_full = np.column_stack([intercept, is_B, is_C, is_D, log_pool])
|
||||
print(f' n={len(y):,}, y_mean={y.mean():.4f}')
|
||||
beta_full, se_full = logistic_fit(X_full, y, l2=0.001)
|
||||
names_full = ['intercept(FirmA)', 'FirmB', 'FirmC', 'FirmD',
|
||||
'log(pool_size_centered)']
|
||||
print(' Full model:')
|
||||
for n, b, s in zip(names_full, beta_full, se_full):
|
||||
print(f' {n}: beta={b:+.4f}, SE={s:.4f}, '
|
||||
f'OR=exp(beta)={np.exp(b):.4f}, '
|
||||
f'p~{abs(b)/s if s>0 else float("inf"):.2f}*SE')
|
||||
|
||||
# Pool-only model (without firm dummies) for comparison
|
||||
X_pool = np.column_stack([intercept, log_pool])
|
||||
beta_pool, se_pool = logistic_fit(X_pool, y, l2=0.001)
|
||||
print(' Pool-only model (no firm dummies):')
|
||||
for n, b, s in zip(['intercept', 'log(pool_size_centered)'],
|
||||
beta_pool, se_pool):
|
||||
print(f' {n}: beta={b:+.4f}, SE={s:.4f}')
|
||||
|
||||
# ── Pool-decile × firm hit rates ──
|
||||
print('\n[Pool-decile × firm hit rates]')
|
||||
deciles = np.percentile(pool_size_arr, np.arange(0, 110, 10))
|
||||
decile_firm = defaultdict(lambda: defaultdict(list))
|
||||
for bi in range(n_b):
|
||||
ps = pool_size_arr[bi]
|
||||
if ps <= 0:
|
||||
continue
|
||||
d = min(int(np.searchsorted(deciles, ps, side='right')) - 1, 9)
|
||||
decile_firm[d][src_firm[bi]].append(int(hit_any_pair[bi]))
|
||||
pool_decile_results = {}
|
||||
for d in range(10):
|
||||
firms_in_d = {}
|
||||
for f, hits in decile_firm[d].items():
|
||||
n_f = len(hits)
|
||||
if n_f == 0:
|
||||
continue
|
||||
far = float(np.mean(hits))
|
||||
firms_in_d[f] = {'n': n_f, 'far': far}
|
||||
pool_decile_results[f'decile_{d+1}'] = {
|
||||
'pool_range': [float(deciles[d]), float(deciles[d+1])],
|
||||
'per_firm': firms_in_d,
|
||||
}
|
||||
line = f' Decile {d+1} (pool {deciles[d]:.0f}-{deciles[d+1]:.0f}):'
|
||||
for f in ['Firm A', 'Firm B', 'Firm C', 'Firm D']:
|
||||
if f in firms_in_d:
|
||||
line += (f' {f}: {firms_in_d[f]["far"]:.4f} '
|
||||
f'(n={firms_in_d[f]["n"]})')
|
||||
print(line)
|
||||
|
||||
# ── Source-firm × candidate-firm hit matrix (any-pair) ──
|
||||
print('\n[Source-firm × candidate-firm hit matrix, max-cos pair]')
|
||||
src_list = ['Firm A', 'Firm B', 'Firm C', 'Firm D']
|
||||
cand_categories = ['Firm A', 'Firm B', 'Firm C', 'Firm D',
|
||||
'non-Big-4']
|
||||
matrix_max_cos = {s: {c: 0 for c in cand_categories}
|
||||
for s in src_list}
|
||||
matrix_min_dh = {s: {c: 0 for c in cand_categories}
|
||||
for s in src_list}
|
||||
matrix_samepair = {s: {c: 0 for c in cand_categories}
|
||||
for s in src_list}
|
||||
src_totals = {s: 0 for s in src_list}
|
||||
for bi in range(n_b):
|
||||
s_f = src_firm[bi]
|
||||
if s_f in src_list:
|
||||
src_totals[s_f] += 1
|
||||
if hit_any_pair[bi]:
|
||||
cf_max = cand_firm_anypair_max_cos[bi]
|
||||
cf_min = cand_firm_anypair_min_dh[bi]
|
||||
cat_max = cf_max if cf_max in src_list else 'non-Big-4'
|
||||
cat_min = cf_min if cf_min in src_list else 'non-Big-4'
|
||||
if s_f in matrix_max_cos:
|
||||
matrix_max_cos[s_f][cat_max] += 1
|
||||
matrix_min_dh[s_f][cat_min] += 1
|
||||
if hit_same_pair[bi]:
|
||||
cf = cand_firm_samepair[bi]
|
||||
cat = cf if cf in src_list else 'non-Big-4'
|
||||
if s_f in matrix_samepair:
|
||||
matrix_samepair[s_f][cat] += 1
|
||||
|
||||
print(' Max-cosine partner firm (count among hits):')
|
||||
print(f' {"Source":<10s} | {" Firm A":>9s} {" Firm B":>9s} '
|
||||
f'{" Firm C":>9s} {" Firm D":>9s} {"non-Big-4":>10s}'
|
||||
f' {"n_source":>10s}')
|
||||
for s in src_list:
|
||||
row = f' {s:<10s} |'
|
||||
for c in cand_categories:
|
||||
row += f' {matrix_max_cos[s][c]:>9d}'
|
||||
row += f' {src_totals[s]:>10d}'
|
||||
print(row)
|
||||
|
||||
print(' Min-dHash partner firm (count among any-pair hits):')
|
||||
for s in src_list:
|
||||
row = f' {s:<10s} |'
|
||||
for c in cand_categories:
|
||||
row += f' {matrix_min_dh[s][c]:>9d}'
|
||||
print(row)
|
||||
|
||||
print(' Same-pair joint hit, candidate firm:')
|
||||
for s in src_list:
|
||||
row = f' {s:<10s} |'
|
||||
for c in cand_categories:
|
||||
row += f' {matrix_samepair[s][c]:>9d}'
|
||||
print(row)
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '44',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_big4_sources': n_b,
|
||||
'n_total_candidate_pool': n_total,
|
||||
'seed': SEED,
|
||||
'note': ('Firm-matched-pool regression + cross-firm hit '
|
||||
'matrix. Confirms Firm A excess is firm '
|
||||
'heterogeneity not pool-size confound.'),
|
||||
},
|
||||
'regression_full': {
|
||||
'feature_names': names_full,
|
||||
'beta': beta_full.tolist(),
|
||||
'se': se_full.tolist(),
|
||||
'odds_ratio': np.exp(beta_full).tolist(),
|
||||
},
|
||||
'regression_pool_only': {
|
||||
'feature_names': ['intercept',
|
||||
'log(pool_size_centered)'],
|
||||
'beta': beta_pool.tolist(),
|
||||
'se': se_pool.tolist(),
|
||||
},
|
||||
'pool_decile_per_firm': pool_decile_results,
|
||||
'cross_firm_hit_matrix': {
|
||||
'max_cos_partner': matrix_max_cos,
|
||||
'min_dh_partner': matrix_min_dh,
|
||||
'same_pair': matrix_samepair,
|
||||
'source_totals': src_totals,
|
||||
},
|
||||
}
|
||||
json_path = OUT / 'firm_matched_pool_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
# Markdown
|
||||
md = ['# Firm-Matched-Pool Regression + Cross-Firm Hit Matrix '
|
||||
'(Script 44)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'n_big4_sources = {n_b:,}; '
|
||||
f'candidate pool drawn from {n_total:,} total signatures '
|
||||
'(any non-same-CPA).',
|
||||
'',
|
||||
'## Logistic regression: hit ~ firm + log(pool_size)',
|
||||
'',
|
||||
'Reference category: Firm A. log(pool_size) centred.',
|
||||
'Hit = any-pair joint (cos>0.95 AND dh<=5).',
|
||||
'',
|
||||
'| Term | beta | SE | OR=exp(beta) |',
|
||||
'|---|---|---|---|']
|
||||
for n, b, s in zip(names_full, beta_full, se_full):
|
||||
md.append(f'| {n} | {b:+.4f} | {s:.4f} | {np.exp(b):.4f} |')
|
||||
md += ['',
|
||||
('A large negative beta on FirmB/C/D dummies AFTER '
|
||||
'controlling for log(pool_size) is evidence that Firm A '
|
||||
"excess is firm heterogeneity, not pool-size confound."),
|
||||
'',
|
||||
'## Pool-decile × firm hit rates (any-pair)',
|
||||
'',
|
||||
'| Decile | Pool range | Firm A | Firm B | Firm C | Firm D |',
|
||||
'|---|---|---|---|---|---|']
|
||||
for d in range(10):
|
||||
key = f'decile_{d+1}'
|
||||
r = pool_decile_results.get(key, {})
|
||||
pf = r.get('per_firm', {})
|
||||
lo, hi = r.get('pool_range', [0, 0])
|
||||
row_cells = [
|
||||
f'{pf[f]["far"]:.4f} (n={pf[f]["n"]})' if f in pf else '—'
|
||||
for f in src_list
|
||||
]
|
||||
md.append(f'| {d+1} | {lo:.0f}-{hi:.0f} | '
|
||||
f'{row_cells[0]} | {row_cells[1]} | '
|
||||
f'{row_cells[2]} | {row_cells[3]} |')
|
||||
md += ['',
|
||||
'## Cross-firm hit matrix (any-pair, max-cosine partner)',
|
||||
'',
|
||||
'| Source firm | A | B | C | D | non-Big-4 | n_source |',
|
||||
'|---|---|---|---|---|---|---|']
|
||||
for s in src_list:
|
||||
row = matrix_max_cos[s]
|
||||
md.append(f'| {s} | {row["Firm A"]} | {row["Firm B"]} | '
|
||||
f'{row["Firm C"]} | {row["Firm D"]} | '
|
||||
f'{row["non-Big-4"]} | {src_totals[s]} |')
|
||||
md += ['', '## Same-pair joint hit, candidate firm', '',
|
||||
'| Source firm | A | B | C | D | non-Big-4 |',
|
||||
'|---|---|---|---|---|---|']
|
||||
for s in src_list:
|
||||
row = matrix_samepair[s]
|
||||
md.append(f'| {s} | {row["Firm A"]} | {row["Firm B"]} | '
|
||||
f'{row["Firm C"]} | {row["Firm D"]} | '
|
||||
f'{row["non-Big-4"]} |')
|
||||
md += ['',
|
||||
'## Interpretation',
|
||||
'',
|
||||
('If max-cosine partners of Firm A source signatures are '
|
||||
'disproportionately drawn from Firm A or from non-Big-4 '
|
||||
'firms (where templates are widely shared), the Firm A '
|
||||
'collision excess reflects an image-manifold property '
|
||||
'rather than a Firm-A-specific replication mechanism. '
|
||||
'The paper interpretation must reflect this carefully.'),
|
||||
'']
|
||||
md_path = OUT / 'firm_matched_pool_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,365 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 45: Full 5-Way Document-Level FAR (HC / HC+MC / HC+MC+HSC)
|
||||
==================================================================
|
||||
Codex round-31 noted: Script 43 reports HC-only document-level FAR
|
||||
(17.97% any-pair). The actual deployed five-way classifier treats
|
||||
the MC band (cos>0.95 AND 5<dh<=15) as "non-hand-signed" too, with
|
||||
worst-case document-level priority HC > MC > HSC > UN > LH. The
|
||||
paper must report doc-level FAR for each alarm definition.
|
||||
|
||||
This script reuses Script 43's per-signature simulation but tracks
|
||||
the full five-way category each source signature would receive
|
||||
under the random-inter-CPA pool, then aggregates to document level
|
||||
under three alarm definitions:
|
||||
D1: HC only
|
||||
D2: HC + MC
|
||||
D3: HC + MC + HSC ("any non-hand-signed verdict")
|
||||
|
||||
For each definition we report:
|
||||
- Per-signature FAR (fraction of source sigs that fall into the
|
||||
alarm category against random pool)
|
||||
- Document-level FAR (any sig in doc triggers alarm)
|
||||
|
||||
The five-way rule used (inherited from v3.20.0 §III-K):
|
||||
HC : cos > 0.95 AND dh <= 5
|
||||
MC : cos > 0.95 AND 5 < dh <= 15
|
||||
HSC: cos > 0.95 AND dh > 15
|
||||
UN : 0.837 < cos <= 0.95
|
||||
LH : cos <= 0.837
|
||||
|
||||
We compute these on the realised (max_cos, min_dh) pair (any-pair
|
||||
semantic, which matches the deployed v3/v4 rule per codex).
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/doc_level_far_full/
|
||||
doc_far_full_results.json
|
||||
doc_far_full_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/doc_level_far_full')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
SEED = 42
|
||||
|
||||
COS_HIGH = 0.95
|
||||
COS_LOW = 0.837
|
||||
DH_HC = 5
|
||||
DH_MC_UPPER = 15
|
||||
|
||||
|
||||
def hamming_vec(query_bytes, cand_bytes_list):
|
||||
q = int.from_bytes(query_bytes, 'big')
|
||||
out = np.empty(len(cand_bytes_list), dtype=np.int32)
|
||||
for i, c in enumerate(cand_bytes_list):
|
||||
out[i] = (q ^ int.from_bytes(c, 'big')).bit_count()
|
||||
return out
|
||||
|
||||
|
||||
def classify_five_way(max_cos, min_dh):
|
||||
if max_cos > COS_HIGH and min_dh <= DH_HC:
|
||||
return 'HC'
|
||||
if max_cos > COS_HIGH and DH_HC < min_dh <= DH_MC_UPPER:
|
||||
return 'MC'
|
||||
if max_cos > COS_HIGH and min_dh > DH_MC_UPPER:
|
||||
return 'HSC'
|
||||
if COS_LOW < max_cos <= COS_HIGH:
|
||||
return 'UN'
|
||||
return 'LH'
|
||||
|
||||
|
||||
def wilson_ci(k, n, z=1.96):
|
||||
if n == 0:
|
||||
return (None, None)
|
||||
phat = k / n
|
||||
denom = 1 + z * z / n
|
||||
centre = (phat + z * z / (2 * n)) / denom
|
||||
half = z * np.sqrt(phat * (1 - phat) / n + z * z / (4 * n * n)) / denom
|
||||
return (max(0.0, centre - half), min(1.0, centre + half))
|
||||
|
||||
|
||||
def load_big4():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.signature_id, s.assigned_accountant, a.firm,
|
||||
s.source_pdf,
|
||||
s.feature_vector, s.dhash_vector
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.feature_vector IS NOT NULL
|
||||
AND s.dhash_vector IS NOT NULL
|
||||
AND a.firm IN (?, ?, ?, ?)
|
||||
''', BIG4)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 45: Full 5-Way Doc-Level FAR (HC / HC+MC / HC+MC+HSC)')
|
||||
print('=' * 72)
|
||||
rows = load_big4()
|
||||
n_sigs = len(rows)
|
||||
print(f'\nLoaded {n_sigs:,} Big-4 signatures')
|
||||
|
||||
cpas = np.array([r[1] for r in rows])
|
||||
firms = np.array([ALIAS[r[2]] for r in rows])
|
||||
source_pdfs = np.array([r[3] for r in rows])
|
||||
feats = np.stack([np.frombuffer(r[4], dtype=np.float32)
|
||||
for r in rows]).astype(np.float32)
|
||||
norms = np.linalg.norm(feats, axis=1, keepdims=True)
|
||||
norms[norms == 0] = 1.0
|
||||
feats = feats / norms
|
||||
dhashes = [r[5] for r in rows]
|
||||
|
||||
cpa_to_idx = defaultdict(list)
|
||||
for i, c in enumerate(cpas):
|
||||
cpa_to_idx[c].append(i)
|
||||
cpa_to_idx = {c: np.array(v, dtype=np.int64)
|
||||
for c, v in cpa_to_idx.items()}
|
||||
pool_sizes = {c: len(v) - 1 for c, v in cpa_to_idx.items()}
|
||||
all_idx = np.arange(n_sigs, dtype=np.int64)
|
||||
|
||||
rng = np.random.default_rng(SEED)
|
||||
print('\nSimulating per-signature category under random inter-CPA pool...')
|
||||
categories = np.empty(n_sigs, dtype=object)
|
||||
max_cos_arr = np.zeros(n_sigs, dtype=np.float32)
|
||||
min_dh_arr = np.zeros(n_sigs, dtype=np.int32)
|
||||
for si in range(n_sigs):
|
||||
if si % 5000 == 0:
|
||||
print(f' {si:,}/{n_sigs:,} ({si/n_sigs*100:.1f}%)')
|
||||
s_cpa = cpas[si]
|
||||
n_pool = pool_sizes[s_cpa]
|
||||
if n_pool <= 0:
|
||||
categories[si] = 'LH'
|
||||
continue
|
||||
same_cpa = cpa_to_idx[s_cpa]
|
||||
need = n_pool
|
||||
cand_indices = []
|
||||
attempts = 0
|
||||
while need > 0 and attempts < 10:
|
||||
draw = rng.choice(n_sigs, size=need * 2, replace=True)
|
||||
same_mask = np.isin(draw, same_cpa)
|
||||
ok = draw[~same_mask]
|
||||
cand_indices.extend(ok[:need].tolist())
|
||||
need -= len(ok[:need])
|
||||
attempts += 1
|
||||
if need > 0:
|
||||
pool_mask = np.ones(n_sigs, dtype=bool)
|
||||
pool_mask[same_cpa] = False
|
||||
pool_idx = all_idx[pool_mask]
|
||||
fb = rng.choice(pool_idx, size=need, replace=False)
|
||||
cand_indices.extend(fb.tolist())
|
||||
cand_indices = np.array(cand_indices[:n_pool], dtype=np.int64)
|
||||
cos_vec = feats[cand_indices] @ feats[si]
|
||||
dh_vec = hamming_vec(dhashes[si],
|
||||
[dhashes[c] for c in cand_indices])
|
||||
max_cos = float(cos_vec.max())
|
||||
min_dh = int(dh_vec.min())
|
||||
max_cos_arr[si] = max_cos
|
||||
min_dh_arr[si] = min_dh
|
||||
categories[si] = classify_five_way(max_cos, min_dh)
|
||||
|
||||
print(' Done.')
|
||||
|
||||
# Per-signature FAR by category
|
||||
print('\n[Per-signature FAR by 5-way category]')
|
||||
cat_counts = defaultdict(int)
|
||||
for c in categories:
|
||||
cat_counts[c] += 1
|
||||
for cat in ['HC', 'MC', 'HSC', 'UN', 'LH']:
|
||||
n_c = cat_counts[cat]
|
||||
far = n_c / n_sigs
|
||||
lo, hi = wilson_ci(n_c, n_sigs)
|
||||
print(f' {cat}: n={n_c:,}, FAR={far:.4f}, '
|
||||
f'CI=[{lo:.4f}, {hi:.4f}]')
|
||||
|
||||
# Per-signature FAR under three alarm definitions
|
||||
print('\n[Per-signature FAR under alarm definitions]')
|
||||
alarm_d1 = (categories == 'HC')
|
||||
alarm_d2 = np.isin(categories, ['HC', 'MC'])
|
||||
alarm_d3 = np.isin(categories, ['HC', 'MC', 'HSC'])
|
||||
persig_fars = {
|
||||
'D1_HC_only': {
|
||||
'far': float(alarm_d1.mean()),
|
||||
'hits': int(alarm_d1.sum()),
|
||||
'n': int(n_sigs),
|
||||
'ci95': wilson_ci(int(alarm_d1.sum()), n_sigs),
|
||||
},
|
||||
'D2_HC_plus_MC': {
|
||||
'far': float(alarm_d2.mean()),
|
||||
'hits': int(alarm_d2.sum()),
|
||||
'n': int(n_sigs),
|
||||
'ci95': wilson_ci(int(alarm_d2.sum()), n_sigs),
|
||||
},
|
||||
'D3_HC_plus_MC_plus_HSC': {
|
||||
'far': float(alarm_d3.mean()),
|
||||
'hits': int(alarm_d3.sum()),
|
||||
'n': int(n_sigs),
|
||||
'ci95': wilson_ci(int(alarm_d3.sum()), n_sigs),
|
||||
},
|
||||
}
|
||||
for k, v in persig_fars.items():
|
||||
print(f' {k}: FAR={v["far"]:.4f}, '
|
||||
f'CI=[{v["ci95"][0]:.4f}, {v["ci95"][1]:.4f}], '
|
||||
f'{v["hits"]:,}/{v["n"]:,}')
|
||||
|
||||
# Document-level FAR under three alarm definitions
|
||||
print('\n[Document-level FAR under alarm definitions]')
|
||||
doc_idx = defaultdict(list)
|
||||
for i, pdf in enumerate(source_pdfs):
|
||||
doc_idx[pdf].append(i)
|
||||
n_docs = len(doc_idx)
|
||||
doc_d1 = 0
|
||||
doc_d2 = 0
|
||||
doc_d3 = 0
|
||||
for pdf, idxs in doc_idx.items():
|
||||
idxs_a = np.array(idxs, dtype=np.int64)
|
||||
if alarm_d1[idxs_a].any():
|
||||
doc_d1 += 1
|
||||
if alarm_d2[idxs_a].any():
|
||||
doc_d2 += 1
|
||||
if alarm_d3[idxs_a].any():
|
||||
doc_d3 += 1
|
||||
print(f' n_documents: {n_docs:,}')
|
||||
print(f' D1 (HC only): FAR={doc_d1/n_docs:.4f} '
|
||||
f'({doc_d1:,}/{n_docs:,})')
|
||||
print(f' D2 (HC+MC): FAR={doc_d2/n_docs:.4f} '
|
||||
f'({doc_d2:,}/{n_docs:,})')
|
||||
print(f' D3 (HC+MC+HSC): FAR={doc_d3/n_docs:.4f} '
|
||||
f'({doc_d3:,}/{n_docs:,})')
|
||||
|
||||
# Per-firm doc-level FAR (D2 = HC+MC, the operational alarm)
|
||||
print('\n[Per-firm doc-level FAR D2 (HC+MC)]')
|
||||
# Map each doc to its dominant firm (mode of its signatures' firms)
|
||||
doc_firm = {}
|
||||
for pdf, idxs in doc_idx.items():
|
||||
fs = firms[idxs]
|
||||
vals, counts = np.unique(fs, return_counts=True)
|
||||
doc_firm[pdf] = str(vals[np.argmax(counts)])
|
||||
per_firm_doc = {}
|
||||
for f in ['Firm A', 'Firm B', 'Firm C', 'Firm D']:
|
||||
pdfs_f = [pdf for pdf, fr in doc_firm.items() if fr == f]
|
||||
n_f = len(pdfs_f)
|
||||
if n_f == 0:
|
||||
continue
|
||||
d1_h = sum(1 for pdf in pdfs_f
|
||||
if alarm_d1[np.array(doc_idx[pdf])].any())
|
||||
d2_h = sum(1 for pdf in pdfs_f
|
||||
if alarm_d2[np.array(doc_idx[pdf])].any())
|
||||
d3_h = sum(1 for pdf in pdfs_f
|
||||
if alarm_d3[np.array(doc_idx[pdf])].any())
|
||||
per_firm_doc[f] = {
|
||||
'n_docs': n_f,
|
||||
'D1_HC': d1_h / n_f,
|
||||
'D2_HC_MC': d2_h / n_f,
|
||||
'D3_HC_MC_HSC': d3_h / n_f,
|
||||
}
|
||||
print(f' {f} (n={n_f:,}): D1={d1_h/n_f:.4f}, '
|
||||
f'D2={d2_h/n_f:.4f}, D3={d3_h/n_f:.4f}')
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '45',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_signatures': n_sigs,
|
||||
'n_documents': n_docs,
|
||||
'seed': SEED,
|
||||
'note': ('Full 5-way doc-level FAR under three alarm '
|
||||
'definitions, with per-firm stratification.'),
|
||||
},
|
||||
'persig_category_counts': dict(cat_counts),
|
||||
'persig_far_by_alarm': persig_fars,
|
||||
'doc_far_by_alarm': {
|
||||
'D1_HC_only': doc_d1 / n_docs,
|
||||
'D2_HC_plus_MC': doc_d2 / n_docs,
|
||||
'D3_HC_plus_MC_plus_HSC': doc_d3 / n_docs,
|
||||
'n_docs': n_docs,
|
||||
'hits': {'D1': doc_d1, 'D2': doc_d2, 'D3': doc_d3},
|
||||
},
|
||||
'per_firm_doc_far': per_firm_doc,
|
||||
}
|
||||
json_path = OUT / 'doc_far_full_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = [
|
||||
'# Full 5-Way Doc-Level FAR (Script 45)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Big-4 signatures: {n_sigs:,}; documents: {n_docs:,}',
|
||||
'',
|
||||
('Per signature, simulate a random inter-CPA candidate pool of '
|
||||
'size n_pool, compute deployed (max-cos, min-dh), assign 5-way '
|
||||
'category, then aggregate to document level under three alarm '
|
||||
'definitions.'),
|
||||
'',
|
||||
'## 5-Way category distribution under random inter-CPA pool',
|
||||
'',
|
||||
'| Category | n | % |',
|
||||
'|---|---|---|',
|
||||
]
|
||||
for cat in ['HC', 'MC', 'HSC', 'UN', 'LH']:
|
||||
n_c = cat_counts[cat]
|
||||
md.append(f'| {cat} | {n_c:,} | {n_c/n_sigs:.4f} |')
|
||||
md += ['',
|
||||
'## Per-signature FAR by alarm definition',
|
||||
'',
|
||||
'| Definition | rule | FAR | 95% CI | hits / n |',
|
||||
'|---|---|---|---|---|',
|
||||
f'| D1 | HC only | {persig_fars["D1_HC_only"]["far"]:.4f} | '
|
||||
f'[{persig_fars["D1_HC_only"]["ci95"][0]:.4f}, '
|
||||
f'{persig_fars["D1_HC_only"]["ci95"][1]:.4f}] | '
|
||||
f'{persig_fars["D1_HC_only"]["hits"]:,} / {n_sigs:,} |',
|
||||
f'| D2 | HC + MC | {persig_fars["D2_HC_plus_MC"]["far"]:.4f} | '
|
||||
f'[{persig_fars["D2_HC_plus_MC"]["ci95"][0]:.4f}, '
|
||||
f'{persig_fars["D2_HC_plus_MC"]["ci95"][1]:.4f}] | '
|
||||
f'{persig_fars["D2_HC_plus_MC"]["hits"]:,} / {n_sigs:,} |',
|
||||
f'| D3 | HC + MC + HSC | '
|
||||
f'{persig_fars["D3_HC_plus_MC_plus_HSC"]["far"]:.4f} | '
|
||||
f'[{persig_fars["D3_HC_plus_MC_plus_HSC"]["ci95"][0]:.4f}, '
|
||||
f'{persig_fars["D3_HC_plus_MC_plus_HSC"]["ci95"][1]:.4f}] | '
|
||||
f'{persig_fars["D3_HC_plus_MC_plus_HSC"]["hits"]:,} / {n_sigs:,} |',
|
||||
'',
|
||||
'## Document-level FAR by alarm definition',
|
||||
'',
|
||||
'| Definition | rule | FAR | hits / n_docs |',
|
||||
'|---|---|---|---|',
|
||||
f'| D1 | any sig HC | {doc_d1/n_docs:.4f} | {doc_d1:,} / {n_docs:,} |',
|
||||
f'| D2 | any sig HC or MC | {doc_d2/n_docs:.4f} | '
|
||||
f'{doc_d2:,} / {n_docs:,} |',
|
||||
f'| D3 | any sig HC, MC, or HSC | {doc_d3/n_docs:.4f} | '
|
||||
f'{doc_d3:,} / {n_docs:,} |',
|
||||
'',
|
||||
'## Per-firm doc-level FAR',
|
||||
'',
|
||||
'| Firm | n_docs | D1 (HC) | D2 (HC+MC) | D3 (HC+MC+HSC) |',
|
||||
'|---|---|---|---|---|']
|
||||
for f, s in per_firm_doc.items():
|
||||
md.append(f'| {f} | {s["n_docs"]:,} | {s["D1_HC"]:.4f} | '
|
||||
f'{s["D2_HC_MC"]:.4f} | {s["D3_HC_MC_HSC"]:.4f} |')
|
||||
md.append('')
|
||||
md_path = OUT / 'doc_far_full_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,385 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Script 46: Alert-Rate Sensitivity / Threshold-Plateau Analysis
|
||||
==============================================================
|
||||
Anchor-based screening framework supplementary validation. With no
|
||||
ground-truth labels, "threshold validation" can only be done via
|
||||
proxies. One proxy: alert-rate sensitivity to threshold perturbation.
|
||||
|
||||
If the v3-inherited threshold (cos>0.95 AND dh<=5) sits at a
|
||||
low-gradient region of the (cos, dh) -> alert-rate surface, that is
|
||||
weak evidence the threshold is a stable operating point. If the
|
||||
surface is everywhere smooth with no plateau, the threshold is an
|
||||
arbitrary point in a continuous specificity-recall tradeoff -- which
|
||||
is consistent with the "no natural threshold" finding from Scripts
|
||||
39b-39e (composition decomposition) and supports the multi-level
|
||||
screening framework framing.
|
||||
|
||||
This script computes alert rates (using actual observed Big-4
|
||||
descriptors, NOT inter-CPA simulated pools) across:
|
||||
- 1D cos threshold sweep at fixed dh<=5
|
||||
- 1D dh threshold sweep at fixed cos>0.95
|
||||
- 2D (cos, dh) grid
|
||||
Per firm and pooled. Gradient-based plateau detection.
|
||||
|
||||
Note: this uses observed (max_cos, min_dh) from each Big-4 signature's
|
||||
real same-CPA pool, i.e., the deployment-side behavior of the rule
|
||||
on the actual corpus (not the inter-CPA negative anchor).
|
||||
|
||||
Outputs:
|
||||
reports/v4_big4/alert_rate_sensitivity/
|
||||
alert_rate_results.json
|
||||
alert_rate_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from collections import defaultdict
|
||||
|
||||
DB = '/Volumes/NV2/PDF-Processing/signature-analysis/signature_analysis.db'
|
||||
OUT = Path('/Volumes/NV2/PDF-Processing/signature-analysis/reports/'
|
||||
'v4_big4/alert_rate_sensitivity')
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
BIG4 = ('勤業眾信聯合', '安侯建業聯合', '資誠聯合', '安永聯合')
|
||||
ALIAS = {'勤業眾信聯合': 'Firm A',
|
||||
'安侯建業聯合': 'Firm B',
|
||||
'資誠聯合': 'Firm C',
|
||||
'安永聯合': 'Firm D'}
|
||||
|
||||
# Threshold grids
|
||||
COS_GRID = np.arange(0.80, 1.00, 0.005) # 41 points
|
||||
DH_GRID = np.arange(0, 21, 1) # 21 integer points
|
||||
COS_FOR_2D = np.arange(0.85, 1.00, 0.01) # 16 cos points for 2D
|
||||
DH_FOR_2D = np.arange(0, 21, 1) # 21 dh points for 2D
|
||||
|
||||
|
||||
def load_big4():
|
||||
conn = sqlite3.connect(f'file:{DB}?mode=ro', uri=True)
|
||||
cur = conn.cursor()
|
||||
cur.execute('''
|
||||
SELECT s.assigned_accountant, a.firm,
|
||||
s.source_pdf,
|
||||
s.max_similarity_to_same_accountant,
|
||||
CAST(s.min_dhash_independent AS REAL)
|
||||
FROM signatures s
|
||||
JOIN accountants a ON s.assigned_accountant = a.name
|
||||
WHERE s.assigned_accountant IS NOT NULL
|
||||
AND s.max_similarity_to_same_accountant IS NOT NULL
|
||||
AND s.min_dhash_independent IS NOT NULL
|
||||
AND a.firm IN (?, ?, ?, ?)
|
||||
''', BIG4)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return rows
|
||||
|
||||
|
||||
def alert_rate(cos_arr, dh_arr, cos_k, dh_k):
|
||||
"""Fraction of (cos, dh) pairs satisfying cos>cos_k AND dh<=dh_k."""
|
||||
n = len(cos_arr)
|
||||
if n == 0:
|
||||
return 0.0
|
||||
return float(((cos_arr > cos_k) & (dh_arr <= dh_k)).mean())
|
||||
|
||||
|
||||
def plateau_gradient(cos_grid, rates):
|
||||
"""Return absolute gradient |d(rate)/d(threshold)| for each
|
||||
interior point, plus min and median gradient."""
|
||||
rates = np.asarray(rates)
|
||||
grads = np.abs(np.diff(rates) / np.diff(cos_grid))
|
||||
return {
|
||||
'gradients': grads.tolist(),
|
||||
'min': float(grads.min()) if len(grads) else None,
|
||||
'median': float(np.median(grads)) if len(grads) else None,
|
||||
'max': float(grads.max()) if len(grads) else None,
|
||||
'argmin_threshold': float(cos_grid[int(np.argmin(grads))])
|
||||
if len(grads) else None,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print('=' * 72)
|
||||
print('Script 46: Alert-Rate Sensitivity / Threshold-Plateau Analysis')
|
||||
print('=' * 72)
|
||||
rows = load_big4()
|
||||
n_sigs = len(rows)
|
||||
print(f'\nLoaded {n_sigs:,} Big-4 signatures')
|
||||
|
||||
firms = np.array([ALIAS[r[1]] for r in rows])
|
||||
source_pdfs = np.array([r[2] for r in rows])
|
||||
cos = np.array([r[3] for r in rows], dtype=np.float32)
|
||||
dh = np.array([r[4] for r in rows], dtype=np.int32)
|
||||
|
||||
# Document grouping
|
||||
doc_idx = defaultdict(list)
|
||||
for i, pdf in enumerate(source_pdfs):
|
||||
doc_idx[pdf].append(i)
|
||||
n_docs = len(doc_idx)
|
||||
print(f' Documents: {n_docs:,}')
|
||||
|
||||
# Per-document worst-case (max cos, min dh)
|
||||
def doc_alert_rate(cos_k, dh_k):
|
||||
"""Fraction of docs with any signature satisfying rule."""
|
||||
hit_docs = 0
|
||||
for pdf, idxs in doc_idx.items():
|
||||
idxs_a = np.array(idxs, dtype=np.int64)
|
||||
if ((cos[idxs_a] > cos_k) & (dh[idxs_a] <= dh_k)).any():
|
||||
hit_docs += 1
|
||||
return hit_docs / n_docs
|
||||
|
||||
results = {
|
||||
'meta': {
|
||||
'script': '46',
|
||||
'timestamp': datetime.now().isoformat(timespec='seconds'),
|
||||
'n_signatures': n_sigs,
|
||||
'n_documents': n_docs,
|
||||
'note': ('Alert-rate sensitivity using observed descriptors '
|
||||
'(not inter-CPA simulation). Per-signature and '
|
||||
'per-document; pooled and per-firm.'),
|
||||
},
|
||||
}
|
||||
|
||||
# ── 1D cos sweep at fixed dh<=5 ──
|
||||
print('\n[1D cos sweep at dh<=5]')
|
||||
sig_rates_cos = {}
|
||||
sig_rates_cos['pooled'] = [alert_rate(cos, dh, k, 5) for k in COS_GRID]
|
||||
for f in sorted(set(firms)):
|
||||
mask = firms == f
|
||||
sig_rates_cos[f] = [alert_rate(cos[mask], dh[mask], k, 5)
|
||||
for k in COS_GRID]
|
||||
print(' cos | pooled | Firm A | Firm B | Firm C | Firm D')
|
||||
for i, k in enumerate(COS_GRID):
|
||||
if i % 4 == 0 or abs(k - 0.95) < 1e-6:
|
||||
line = f' {k:.3f} | {sig_rates_cos["pooled"][i]:.4f}'
|
||||
for f in ['Firm A', 'Firm B', 'Firm C', 'Firm D']:
|
||||
line += f' | {sig_rates_cos[f][i]:.4f}'
|
||||
print(line)
|
||||
|
||||
cos_pooled_grad = plateau_gradient(COS_GRID, sig_rates_cos['pooled'])
|
||||
print(f'\n pooled gradient summary: min={cos_pooled_grad["min"]:.5f}, '
|
||||
f'median={cos_pooled_grad["median"]:.5f}, '
|
||||
f'max={cos_pooled_grad["max"]:.5f}')
|
||||
print(f' argmin of |grad| at cos={cos_pooled_grad["argmin_threshold"]:.3f}')
|
||||
|
||||
# ── 1D dh sweep at fixed cos>0.95 ──
|
||||
print('\n[1D dh sweep at cos>0.95]')
|
||||
sig_rates_dh = {}
|
||||
sig_rates_dh['pooled'] = [alert_rate(cos, dh, 0.95, k) for k in DH_GRID]
|
||||
for f in sorted(set(firms)):
|
||||
mask = firms == f
|
||||
sig_rates_dh[f] = [alert_rate(cos[mask], dh[mask], 0.95, k)
|
||||
for k in DH_GRID]
|
||||
print(' dh | pooled | Firm A | Firm B | Firm C | Firm D')
|
||||
for i, k in enumerate(DH_GRID):
|
||||
line = f' {k:2d} | {sig_rates_dh["pooled"][i]:.4f}'
|
||||
for f in ['Firm A', 'Firm B', 'Firm C', 'Firm D']:
|
||||
line += f' | {sig_rates_dh[f][i]:.4f}'
|
||||
print(line)
|
||||
|
||||
dh_pooled_grad = plateau_gradient(DH_GRID, sig_rates_dh['pooled'])
|
||||
print(f'\n pooled gradient summary: min={dh_pooled_grad["min"]:.5f}, '
|
||||
f'median={dh_pooled_grad["median"]:.5f}, '
|
||||
f'max={dh_pooled_grad["max"]:.5f}')
|
||||
print(f' argmin of |grad| at dh={dh_pooled_grad["argmin_threshold"]:.0f}')
|
||||
|
||||
# ── 2D (cos, dh) surface ──
|
||||
print('\n[2D (cos, dh) alert-rate surface]')
|
||||
surface = np.zeros((len(COS_FOR_2D), len(DH_FOR_2D)), dtype=np.float32)
|
||||
for i, ck in enumerate(COS_FOR_2D):
|
||||
for j, dk in enumerate(DH_FOR_2D):
|
||||
surface[i, j] = alert_rate(cos, dh, ck, dk)
|
||||
print(' Surface dimensions:', surface.shape)
|
||||
# Print a few key rows
|
||||
for i, ck in enumerate(COS_FOR_2D):
|
||||
if abs(ck - 0.85) < 1e-6 or abs(ck - 0.90) < 1e-6 \
|
||||
or abs(ck - 0.95) < 1e-6 or abs(ck - 0.98) < 1e-6:
|
||||
line = f' cos>{ck:.2f}:'
|
||||
for j, dk in enumerate(DH_FOR_2D):
|
||||
if dk in [0, 3, 5, 8, 10, 15, 20]:
|
||||
line += f' dh<={dk}: {surface[i, j]:.4f},'
|
||||
print(line)
|
||||
|
||||
# Compute 2D gradient magnitude at key threshold (cos=0.95, dh=5)
|
||||
# Find indices
|
||||
i95 = int(np.argmin(np.abs(COS_FOR_2D - 0.95)))
|
||||
j5 = int(np.argmin(np.abs(DH_FOR_2D - 5)))
|
||||
if 0 < i95 < len(COS_FOR_2D) - 1 and 0 < j5 < len(DH_FOR_2D) - 1:
|
||||
dcos = (surface[i95 + 1, j5] - surface[i95 - 1, j5]) / \
|
||||
(COS_FOR_2D[i95 + 1] - COS_FOR_2D[i95 - 1])
|
||||
ddh = (surface[i95, j5 + 1] - surface[i95, j5 - 1]) / \
|
||||
(DH_FOR_2D[j5 + 1] - DH_FOR_2D[j5 - 1])
|
||||
grad_mag = float(np.sqrt(dcos ** 2 + ddh ** 2))
|
||||
else:
|
||||
dcos = ddh = grad_mag = None
|
||||
print(f'\n At (cos=0.95, dh=5): rate={surface[i95, j5]:.4f}')
|
||||
print(f' d(rate)/d(cos) ~ {dcos:.4f} (per unit cos)')
|
||||
print(f' d(rate)/d(dh) ~ {ddh:.4f} (per unit dh)')
|
||||
print(f' gradient magnitude ~ {grad_mag:.4f}')
|
||||
|
||||
# ── Document-level 1D cos sweep ──
|
||||
print('\n[Document-level 1D cos sweep at dh<=5]')
|
||||
doc_rates_cos = [doc_alert_rate(k, 5) for k in COS_GRID]
|
||||
for i, k in enumerate(COS_GRID):
|
||||
if i % 4 == 0 or abs(k - 0.95) < 1e-6:
|
||||
print(f' cos > {k:.3f}: doc-FAR (HC) = {doc_rates_cos[i]:.4f}')
|
||||
|
||||
doc_cos_grad = plateau_gradient(COS_GRID, doc_rates_cos)
|
||||
print(f'\n doc gradient summary: min={doc_cos_grad["min"]:.5f}, '
|
||||
f'median={doc_cos_grad["median"]:.5f}, '
|
||||
f'max={doc_cos_grad["max"]:.5f}')
|
||||
|
||||
# ── Plateau detection summary ──
|
||||
print('\n[Plateau detection summary]')
|
||||
cos095_idx = int(np.argmin(np.abs(COS_GRID - 0.95)))
|
||||
dh5_idx = int(np.argmin(np.abs(DH_GRID - 5)))
|
||||
if 0 < cos095_idx < len(sig_rates_cos['pooled']) - 1:
|
||||
local_grad_cos = abs(
|
||||
sig_rates_cos['pooled'][cos095_idx + 1] -
|
||||
sig_rates_cos['pooled'][cos095_idx - 1]) / \
|
||||
(COS_GRID[cos095_idx + 1] - COS_GRID[cos095_idx - 1])
|
||||
else:
|
||||
local_grad_cos = None
|
||||
if 0 < dh5_idx < len(sig_rates_dh['pooled']) - 1:
|
||||
local_grad_dh = abs(
|
||||
sig_rates_dh['pooled'][dh5_idx + 1] -
|
||||
sig_rates_dh['pooled'][dh5_idx - 1]) / \
|
||||
(DH_GRID[dh5_idx + 1] - DH_GRID[dh5_idx - 1])
|
||||
else:
|
||||
local_grad_dh = None
|
||||
median_grad_cos = cos_pooled_grad['median']
|
||||
median_grad_dh = dh_pooled_grad['median']
|
||||
ratio_cos = (local_grad_cos / median_grad_cos
|
||||
if median_grad_cos and median_grad_cos > 0 else None)
|
||||
ratio_dh = (local_grad_dh / median_grad_dh
|
||||
if median_grad_dh and median_grad_dh > 0 else None)
|
||||
print(f' v3 inherited cos=0.95 local |grad|={local_grad_cos:.5f}, '
|
||||
f'median |grad|={median_grad_cos:.5f}, '
|
||||
f'ratio={ratio_cos:.2f}')
|
||||
print(f' v3 inherited dh=5 local |grad|={local_grad_dh:.5f}, '
|
||||
f'median |grad|={median_grad_dh:.5f}, '
|
||||
f'ratio={ratio_dh:.2f}')
|
||||
if ratio_cos is not None and ratio_cos < 0.5:
|
||||
print(' -> cos=0.95 IS at a low-gradient region (plateau-like).')
|
||||
elif ratio_cos is not None and ratio_cos > 1.5:
|
||||
print(' -> cos=0.95 IS at a high-gradient region (steep slope).')
|
||||
else:
|
||||
print(' -> cos=0.95 is at a moderate-gradient region '
|
||||
'(no clear plateau or cliff).')
|
||||
if ratio_dh is not None and ratio_dh < 0.5:
|
||||
print(' -> dh=5 IS at a low-gradient region (plateau-like).')
|
||||
elif ratio_dh is not None and ratio_dh > 1.5:
|
||||
print(' -> dh=5 IS at a high-gradient region.')
|
||||
else:
|
||||
print(' -> dh=5 is at a moderate-gradient region.')
|
||||
|
||||
results['cos_sweep_at_dh_5'] = {
|
||||
'cos_grid': COS_GRID.tolist(),
|
||||
'sig_rates': {k: v for k, v in sig_rates_cos.items()},
|
||||
'pooled_gradient_summary': cos_pooled_grad,
|
||||
}
|
||||
results['dh_sweep_at_cos_0_95'] = {
|
||||
'dh_grid': DH_GRID.tolist(),
|
||||
'sig_rates': {k: v for k, v in sig_rates_dh.items()},
|
||||
'pooled_gradient_summary': dh_pooled_grad,
|
||||
}
|
||||
results['surface_2d'] = {
|
||||
'cos_axis': COS_FOR_2D.tolist(),
|
||||
'dh_axis': DH_FOR_2D.tolist(),
|
||||
'rates': surface.tolist(),
|
||||
'at_v3_threshold': {
|
||||
'cos_0.95_dh_5_rate': float(surface[i95, j5]),
|
||||
'd_rate_d_cos': dcos,
|
||||
'd_rate_d_dh': ddh,
|
||||
'gradient_magnitude': grad_mag,
|
||||
},
|
||||
}
|
||||
results['doc_level_cos_sweep_at_dh_5'] = {
|
||||
'cos_grid': COS_GRID.tolist(),
|
||||
'doc_rates': doc_rates_cos,
|
||||
'doc_gradient_summary': doc_cos_grad,
|
||||
}
|
||||
results['plateau_detection'] = {
|
||||
'v3_cos_0_95': {
|
||||
'local_gradient': local_grad_cos,
|
||||
'median_gradient': median_grad_cos,
|
||||
'ratio_local_to_median': ratio_cos,
|
||||
},
|
||||
'v3_dh_5': {
|
||||
'local_gradient': local_grad_dh,
|
||||
'median_gradient': median_grad_dh,
|
||||
'ratio_local_to_median': ratio_dh,
|
||||
},
|
||||
}
|
||||
json_path = OUT / 'alert_rate_results.json'
|
||||
json_path.write_text(json.dumps(results, indent=2, ensure_ascii=False),
|
||||
encoding='utf-8')
|
||||
print(f'\n[json] {json_path}')
|
||||
|
||||
md = [
|
||||
'# Alert-Rate Sensitivity / Threshold-Plateau Analysis '
|
||||
'(Script 46)',
|
||||
'', f'Generated: {results["meta"]["timestamp"]}',
|
||||
f'Big-4 signatures: {n_sigs:,}; documents: {n_docs:,}',
|
||||
'',
|
||||
('Alert-rate sensitivity to threshold perturbation. If the '
|
||||
'v3-inherited threshold cos>0.95 AND dh<=5 sits at a '
|
||||
'low-gradient region, that is weak evidence the threshold is '
|
||||
'a stable operating point. If the alert-rate surface is '
|
||||
'everywhere smooth without a plateau, the threshold is one '
|
||||
'point on a continuous specificity-recall tradeoff -- '
|
||||
'consistent with the no-natural-threshold finding from '
|
||||
'Scripts 39b-39e.'),
|
||||
'',
|
||||
'## Plateau detection at v3 inherited thresholds',
|
||||
'',
|
||||
'| Threshold | local |grad| | median |grad| | ratio | interpretation |',
|
||||
'|---|---|---|---|---|',
|
||||
f'| cos=0.95 | {local_grad_cos:.5f} | '
|
||||
f'{median_grad_cos:.5f} | {ratio_cos:.2f} | '
|
||||
f'{"plateau" if ratio_cos < 0.5 else ("cliff" if ratio_cos > 1.5 else "moderate")} |',
|
||||
f'| dh=5 | {local_grad_dh:.5f} | {median_grad_dh:.5f} | '
|
||||
f'{ratio_dh:.2f} | '
|
||||
f'{"plateau" if ratio_dh < 0.5 else ("cliff" if ratio_dh > 1.5 else "moderate")} |',
|
||||
'',
|
||||
'## 1D cos sweep at dh<=5 (per-signature alert rate)',
|
||||
'',
|
||||
'| cos > k | pooled | Firm A | Firm B | Firm C | Firm D |',
|
||||
'|---|---|---|---|---|---|',
|
||||
]
|
||||
for i, k in enumerate(COS_GRID):
|
||||
if i % 2 == 0:
|
||||
md.append(f'| {k:.3f} | {sig_rates_cos["pooled"][i]:.4f} | '
|
||||
f'{sig_rates_cos["Firm A"][i]:.4f} | '
|
||||
f'{sig_rates_cos["Firm B"][i]:.4f} | '
|
||||
f'{sig_rates_cos["Firm C"][i]:.4f} | '
|
||||
f'{sig_rates_cos["Firm D"][i]:.4f} |')
|
||||
md += ['',
|
||||
'## 1D dh sweep at cos>0.95 (per-signature alert rate)',
|
||||
'',
|
||||
'| dh <= k | pooled | Firm A | Firm B | Firm C | Firm D |',
|
||||
'|---|---|---|---|---|---|']
|
||||
for i, k in enumerate(DH_GRID):
|
||||
md.append(f'| {int(k):2d} | {sig_rates_dh["pooled"][i]:.4f} | '
|
||||
f'{sig_rates_dh["Firm A"][i]:.4f} | '
|
||||
f'{sig_rates_dh["Firm B"][i]:.4f} | '
|
||||
f'{sig_rates_dh["Firm C"][i]:.4f} | '
|
||||
f'{sig_rates_dh["Firm D"][i]:.4f} |')
|
||||
md += ['',
|
||||
'## Document-level cos sweep at dh<=5',
|
||||
'',
|
||||
'| cos > k | doc alert rate (HC) |',
|
||||
'|---|---|']
|
||||
for i, k in enumerate(COS_GRID):
|
||||
if i % 2 == 0:
|
||||
md.append(f'| {k:.3f} | {doc_rates_cos[i]:.4f} |')
|
||||
md.append('')
|
||||
md_path = OUT / 'alert_rate_report.md'
|
||||
md_path.write_text('\n'.join(md), encoding='utf-8')
|
||||
print(f'[md ] {md_path}')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user