Kolmogorov-Smirnov two-sample test

3.5k Views Asked by At

I want to test if two samples are drawn from the same distribution. I generated two random arrays and used a python function to derive the KS statistic $D$ and the two-tailed p-value $P$:

>>> import numpy as np
>>> from scipy import stats
>>> a=np.random.random_integers(1,9,4)
>>> a
array([3, 7, 4, 3])
>>> b=np.random.random_integers(1,9,5)
>>> b
array([2, 2, 3, 7, 9])
>>> stats.ks_2samp(a,b)
(0.40000000000000002, 0.75428850089034016)

From the documentation of http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html I know that

$$D=0.40000000000000002$$ and $$P=0.75428850089034016$$ So the probability that the two samples are drawn from the same distribution is $\sim75\%$.

Now my question is what does $D$ tell me? And is there a simple way to calculate these two values by hand?

The wikipedia article does not have a simple example with two samples, that is why I am trying finally to find an answer here.

1

There are 1 best solutions below

0
On BEST ANSWER

One rejects the null hypothesis when the P-value is small. A common criterion is to reject if the P-values is less than 0.05.

In a Kolmogorov-Smirnov test, the D-statistic measures the maximum diagonal distance between the empirical cumulative distribution functions (ECDFs) of the two samples. (Everything is re-scaled so the ECDF fits inside the unit square.)

An ECDF is made by sorting the data and plotting it along the horizontal axis. Then the ECDF is a non-decreasing stair-step function that rises by 1/n at each of the n sorted data points. An ECDF is intended to approximate the cumulative distribution function (CDF) of the probability distribution from which the data were randomly sampled.

It is often difficult to distinguish between two distributions with small amounts of data. So it might be more revealing if you generated your fake experimental data with larger sample sizes.

Below is a session in R, in which x and y come from the same distribution and z comes from a different distribution. I show K-S tests to compare x and y and to compare x and z.

 x = rnorm(100, 50, 2);  y = rnorm(100, 50, 2);  z = rnorm(100, 65, 3)
 ks.test(x,y)

 #        Two-sample Kolmogorov-Smirnov test

 # data:  x and y 
 # D = 0.11, p-value = 0.5806  # Huge P-value, don't reject
 # alternative hypothesis: two.sided 

 ks.test(x,z)

 #        Two-sample Kolmogorov-Smirnov test

 # data:  x and z 
 # D = 1, p-value < 2.2e-16  # tiny P-value, so reject
 # alternative hypothesis: two.sided