Two-Sample Kolmogorov-Smirnov Test

2.3k Views Asked by At

I am trying to understand the Two-Sample Kolmogorov-Smirnov Test. Somehow no where are good examples connecting math and real example especially having to different distributions. Does someone knows a example to find or can give me one?

I added an example into my question and would like to check whether I have the same understanding and how do I calculate the p-value now?:

ID  Sample X    Sample Y    Cum F(X)    Cum F(Y)    Diff    
1   4           1           0.026490066 0.008196721 0.018293345 
2   28          18          0.21192053  0.155737705 0.056182825 
3   24          25          0.370860927 0.360655738 0.010205189 
4   21          5           0.509933775 0.401639344 0.108294431 
5   23          13          0.662251656 0.508196721 0.154054934 
6   12          7           0.741721854 0.56557377  0.176148084 
7   7           20          0.78807947  0.729508197 0.058571273 
8   23          13          0.940397351 0.836065574 0.104331777 
9   9           20          1           1           0   
Sum 151 122     D-stat  0.176148084 
Count   9   9   D-crit  0.64021448  
                Significance    No  

                No  H_0 the samples come from P,
                Yes H_1 the samples do not come from P

To explain in math I did the following:

I have two samples (X and Y) and I would like to test if their distributions are the same.

  • $X = Sample$ X
  • $Y = Sample$ Y
  • $F(X_i) = \frac{X_i}{N};$ Observed cumulative frequency distribution of a random sample of n observations; (No.of observations ≤ X)/(sum observations)
  • $F(Y_i) = \frac{X_i}{N};$ Observed cumulative frequency distribution of a random sample of n observations; No.of observations ≤ Y)/(sum observations)
  • $F(Y_i) = \frac{Y_i}{N};$ Observed
  • $n_X = \sum_{i=1}^{n}{X_i}$; $n_Y = \sum_{i=1}^{n}{Y_i}$
  • $D-stat = max(F(X) - F(Y))$
  • $D-cri = c(\alpha)\sqrt(\frac{n_X+n_Y}{n_X*n_Y})$
  • Hypothesis check: if D-Stat > D-Crit H0 will be rejected
  • 95% significance level, alpha 0.05, $c(\alpha)$ = 1.3581
1

There are 1 best solutions below

2
On

This process is also described on the English Wikipedia.

Construct CDFs:

  • Sort. \begin{align*} X&: (4, 7, 9, 12, 21, 23, 23, 24, 28) \\ Y&: (1, 5, 7, 13, 13, 18, 20, 20, 25) \end{align*}
  • Construct CDFs. These should be your Cum F(...)s \begin{align*} CDF(X) &= \begin{cases} 0 & \phantom{4\leq{}} x <4 \\ \frac{1}{9} & 4\leq x<7 \\ \frac{2}{9} & 7\leq x<9 \\ \frac{1}{3} & 9\leq x<12 \\ \frac{4}{9} & 12\leq x<21 \\ \frac{5}{9} & 21\leq x<23 \\ \frac{7}{9} & 23\leq x<24 \\ \frac{8}{9} & 24\leq x<28 \\ 1 & 28 \leq x \end{cases} \\ CDF(Y) &= \begin{cases} 0 & \phantom{1\leq{}}x < 1 \\ \frac{1}{9} & 1\leq x<5 \\ \frac{2}{9} & 5\leq x<7 \\ \frac{1}{3} & 7\leq x<13 \\ \frac{5}{9} & 13\leq x<18 \\ \frac{2}{3} & 18\leq x<20 \\ \frac{8}{9} & 20\leq x<25 \\ 1 & 25 \leq x \end{cases} \end{align*} Let's plot these. CDFs plotted on same axes
  • Now we compute $|\mathrm{CDF}(X) - \mathrm{CDF}(Y)|$, marking the global maximum. $$ |\mathrm{CDF}(X) - \mathrm{CDF}(Y)| = \begin{cases} 0 & \phantom{1\leq{}}x < 1 \\ \frac{1}{9} & 1\leq x<4 \\ 0 & 4\leq x<5 \\ \frac{1}{9} & 5\leq x<9 \\ 0 & 9\leq x<12 \\ \frac{1}{9} & 12\leq x<18 \\ \frac{2}{9} & 18\leq x<20 \\ \frac{4}{9} \ast & 20\leq x<21 \\ \frac{1}{3} & 21\leq x<23 \\ \frac{1}{9} & 23\leq x<24 \\ 0 & 24\leq x<25 \\ \frac{1}{9} & 25\leq x<28 \\ 0 & 28\leq x \end{cases} $$
  • So your test statistic is $4/9 = 0.\overline{4}$. As you have calculated, the critical value at the $\alpha = 0.05$ level is $0.64021{\dots}$. Since the test statistic is less than the critical value, the null hypothesis (that the two samples are drawn from the same distribution) is not rejected.
  • $p$-values are typically either provided by software or found in tables. For example, KolmogorovSmirnovTest[] in Mathematica 11.3 finds the $p$-value for this test statistic for samples of sizes $(9,9)$ is $0.27396{\dots}$. The R {stats} package implements the test and $p$-value computation in ks.test. Python's SciPy implements these calculations as scipy.stats.ks_2samp(). There is even an Excel implementation called KS2TEST.