I am trying to understand the Two-Sample Kolmogorov-Smirnov Test. Somehow no where are good examples connecting math and real example especially having to different distributions. Does someone knows a example to find or can give me one?
I added an example into my question and would like to check whether I have the same understanding and how do I calculate the p-value now?:
ID Sample X Sample Y Cum F(X) Cum F(Y) Diff
1 4 1 0.026490066 0.008196721 0.018293345
2 28 18 0.21192053 0.155737705 0.056182825
3 24 25 0.370860927 0.360655738 0.010205189
4 21 5 0.509933775 0.401639344 0.108294431
5 23 13 0.662251656 0.508196721 0.154054934
6 12 7 0.741721854 0.56557377 0.176148084
7 7 20 0.78807947 0.729508197 0.058571273
8 23 13 0.940397351 0.836065574 0.104331777
9 9 20 1 1 0
Sum 151 122 D-stat 0.176148084
Count 9 9 D-crit 0.64021448
Significance No
No H_0 the samples come from P,
Yes H_1 the samples do not come from P
To explain in math I did the following:
I have two samples (X and Y) and I would like to test if their distributions are the same.
- $X = Sample$ X
- $Y = Sample$ Y
- $F(X_i) = \frac{X_i}{N};$ Observed cumulative frequency distribution of a random sample of n observations; (No.of observations ≤ X)/(sum observations)
- $F(Y_i) = \frac{X_i}{N};$ Observed cumulative frequency distribution of a random sample of n observations; No.of observations ≤ Y)/(sum observations)
- $F(Y_i) = \frac{Y_i}{N};$ Observed
- $n_X = \sum_{i=1}^{n}{X_i}$; $n_Y = \sum_{i=1}^{n}{Y_i}$
- $D-stat = max(F(X) - F(Y))$
- $D-cri = c(\alpha)\sqrt(\frac{n_X+n_Y}{n_X*n_Y})$
- Hypothesis check: if D-Stat > D-Crit H0 will be rejected
- 95% significance level, alpha 0.05, $c(\alpha)$ = 1.3581
This process is also described on the English Wikipedia.
Construct CDFs:
Cum F(...)s \begin{align*} CDF(X) &= \begin{cases} 0 & \phantom{4\leq{}} x <4 \\ \frac{1}{9} & 4\leq x<7 \\ \frac{2}{9} & 7\leq x<9 \\ \frac{1}{3} & 9\leq x<12 \\ \frac{4}{9} & 12\leq x<21 \\ \frac{5}{9} & 21\leq x<23 \\ \frac{7}{9} & 23\leq x<24 \\ \frac{8}{9} & 24\leq x<28 \\ 1 & 28 \leq x \end{cases} \\ CDF(Y) &= \begin{cases} 0 & \phantom{1\leq{}}x < 1 \\ \frac{1}{9} & 1\leq x<5 \\ \frac{2}{9} & 5\leq x<7 \\ \frac{1}{3} & 7\leq x<13 \\ \frac{5}{9} & 13\leq x<18 \\ \frac{2}{3} & 18\leq x<20 \\ \frac{8}{9} & 20\leq x<25 \\ 1 & 25 \leq x \end{cases} \end{align*} Let's plot these.KolmogorovSmirnovTest[]in Mathematica 11.3 finds the $p$-value for this test statistic for samples of sizes $(9,9)$ is $0.27396{\dots}$. The R {stats} package implements the test and $p$-value computation inks.test. Python's SciPy implements these calculations asscipy.stats.ks_2samp(). There is even an Excel implementation called KS2TEST.