Paper A v3.18.1: address remaining partner red-pen prose clarity items

Three targeted fixes per partner's red-pen audit (residue from v3.18 cleanup): 1. III-D 92.6% match rate -- partner red-circled the bare figure ("不太懂改善線"). Add explicit explanation of the unmatched 7.4% (13,573 signatures): they could not be matched to a registered CPA name (deviation from two-signature layout, OCR-name mismatch) and are excluded from same-CPA pairwise analyses for definitional reasons, not discarded as noise. 2. III-I.1 Hartigan dip-test wording -- partner wrote "?所以為何?" next to the "rejecting unimodality is consistent with but does not directly establish bimodality" sentence. Replace with a direct three-line explanation: the test asks "is the distribution single-peaked?", a non-significant p means we cannot reject single-peak, a significant p means more than one peak (could be 2/3/...). Removes the partner's confusion without losing rigor. 3. IV-G validation lead-in -- partner wrote "不太懂為何陳述?" on the tangled "consistency check / threshold-free / operational classifier" triple. Rewrite as a three-bullet structure that names the *informative quantity* in each subsection (temporal trend / concentration ratio / cross-firm gap) and states explicitly why each is robust to cutoff choice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:48:59 +08:00
parent 16e90bab20
commit cb77f481ec
3 changed files with 14 additions and 6 deletions
@@ -74,6 +74,7 @@ Batch inference on all 86,071 documents extracted 182,328 signature images at a
 A red stamp removal step was applied to each cropped signature using HSV color-space filtering, replacing detected red regions with white pixels to isolate the handwritten content.
 Each signature was matched to its corresponding CPA using positional order (first or second signature on the page) against the official CPA registry, achieving a 92.6% match rate (168,755 of 182,328 signatures).
 The remaining 7.4% (13,573 signatures) could not be matched to a registered CPA name---typically because the auditor's report page format deviates from the standard two-signature layout, or because OCR of the printed CPA name on the page returns a name not present in the registry---and these signatures are excluded from all subsequent same-CPA pairwise analyses (a same-CPA best-match statistic is undefined when a signature has no assigned CPA). The 92.6% matched subset is the sample that flows into Sections IV-D through IV-H; the unmatched 7.4% are excluded for definitional reasons rather than discarded as noise.
 ## E. Feature Extraction
@@ -188,7 +189,11 @@ Because all three diagnostics are applied to the same sample rather than to inde
 We use two closely related KDE-based threshold estimators and apply each where it is appropriate.
 When two labeled populations are available (e.g., the all-pairs intra-class and inter-class similarity distributions of Section IV-C), the *KDE crossover* is the intersection point of the two kernel density estimates under Scott's rule for bandwidth selection [28]; under equal priors and symmetric misclassification costs it approximates the Bayes-optimal decision boundary between the two classes.
 When a single distribution is analysed (e.g., the per-signature best-match cosine distribution of Section IV-D) the *KDE antimode* is the local density minimum between two modes of the fitted density; it serves the same decision-theoretic role when the distribution is multimodal but is undefined when the distribution is unimodal.
-In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality (rejecting the null of unimodality is consistent with but does not directly establish bimodality specifically), and perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.
+In either case we use the Hartigan & Hartigan dip test [37] as a formal test of unimodality.
 The dip test asks one question: *is the distribution single-peaked?*
 A non-significant $p$-value means we cannot reject the single-peak null (the data are consistent with one peak); a significant $p$-value means the distribution has *more than one peak* (it could be two, three, or more---the test does not specify how many).
 We use the test to decide whether a KDE antimode is well-defined (it is, only when there is more than one peak), not to assert any particular number of components.
 We additionally perform a sensitivity analysis varying the bandwidth over $\pm 50\%$ of the Scott's-rule value to verify threshold stability.
 ### 2) Method 2: Finite Mixture Model via EM
@@ -232,11 +232,14 @@ The paper therefore retains cos $> 0.95$ as the primary operational cut for tran
 ## G. Additional Firm A Benchmark Validation
-The capture rates of Section IV-E are a within-sample consistency check: they evaluate how well a threshold captures Firm A, but the thresholds themselves are anchored to Firm A's percentiles.
+The capture rates of Section IV-E are an *internal* consistency check: they ask "how much of Firm A does our threshold capture?", but the threshold was itself derived from Firm A's percentiles, so a high capture rate is not surprising.
-This section reports three complementary analyses that go beyond the whole-sample capture rates.
+To go beyond this circular check, we report three further analyses, each chosen so that the *informative quantity* does not depend on the threshold's absolute value:
-Subsection H.2 is fully threshold-independent (it uses only ordinal ranking).
+
-Subsection H.1 uses a fixed 0.95 cutoff but derives information from the longitudinal stability of rates rather than from the absolute rate at any single year.
+- **§IV-G.1 (year-by-year stability).** Holds the cosine cutoff fixed at 0.95 and asks whether the share of Firm A below the cutoff is *stable across years*. The information is in the temporal trend, not in the absolute rate; under a noise-only explanation of the left tail, the share should shrink as scan/PDF technology matured.
-Subsection H.3 applies the calibrated classifier and is therefore a consistency check on the classifier's firm-level output rather than a threshold-free test; the informative quantity is the cross-firm *gap* rather than the absolute agreement rate at any one firm.
+- **§IV-G.2 (partner-level similarity ranking).** Uses *no threshold at all*: every auditor-year is ranked by mean similarity, and we measure Firm A's share of the top decile against its baseline share. The information is in the concentration ratio, which is invariant to the choice of cutoff.
 - **§IV-G.3 (intra-report agreement).** Applies the calibrated classifier and measures whether the *two co-signing CPAs on the same Firm A report* receive the same classifier label, then compares Firm A's intra-report agreement rate to the other firms'. The information is in the *cross-firm gap*; the absolute agreement rate at any one firm depends on the cutoff, but the gap is robust to moderate cutoff shifts as long as the same cutoff is applied uniformly across firms.
 Together these three analyses provide threshold-free or threshold-robust evidence that complements the within-sample capture rates of Section IV-E.
 ### 1) Year-by-Year Stability of the Firm A Left Tail