Check for distribution of the sample with unknown parameters using ks.test in R.

242 Views Asked by At

When I do a ks.test in R for a sample to check from which distribution it is, it gives me a $p$ value less than 0.01 for various distributions and I don't know why. Maybe because of parameters or smth? Also, I have a dataset in r with two columns (samples) and the ks.test even gives an output for the whole dataset ( when I write ks.test(x = data,...). Anyway I don't know how to correct the issue so that the test really shows from which distribution the data is drawn. Almost for every distribution the p value is given as much less than 0.01.

1

There are 1 best solutions below

0
On

I have no idea what data you are using or how you are choosing their supposed population distributions. Maybe a demonstration where the Kolmogorov-Smirnov test in R does work will be helpful.

Suppose use R to take a sample of size $n = 100$ from a population known to be $\mathsf{Norm}(mu = 100, \sigma = 15).$ Then I use the K-S test to see if the data match their parent distribution. Then the K-S test does not reject the null hypothesis that the data are distributed as $\mathsf{Norm}(mu = 100, \sigma = 15).$

set.seed(527)
x = rnorm(100, 100, 15)
ks.test(x, "pnorm", 100, 15)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.10263, p-value = 0.2428
alternative hypothesis: two-sided

Now suppose I generate a random sample of size 200 from $\mathsf{Exp}(\lambda = 0.01),$ which has mean 100 and standard deviation 100. Then I do a K-S test to see if the data are consistent with a normal distribution with $\mu = \sigma = 100.$

set.seed(2020)
y = rexp(200, 0.01)
ks.test(y, "pnorm", 100, 100)

        One-sample Kolmogorov-Smirnov test

data:  y
D = 0.15885, p-value = 8.266e-05
alternative hypothesis: two-sided

The null hypothesis that the data are from $\mathsf{Norm}(100,100)$ is rejected with a tiny P-value. The mean and variance are correct, but the K-S test detects that the shape of the distribution is wrong. However, a K-S test that the data are from $\mathsf{Exp}(0.01)$ is not rejected:

ks.test(y, "pexp", .01)$p.val
[1] 0.3032855

The K-S test works by comparing the empirical CDF (ECDF) of the sample with the CDF of the hypothetical distribution. An ECDF of continuous data is a step-function that increases by $1/n$ at each (sorted) observation.

In the plot below, the heavy dots show the ECDF of the sample. The heavy blue curve is the CDF of $\mathsf{Exp}(0.01),$ which matches the ECDF pretty well. But the CDF of $\mathsf{Norm}(100,100)$ (broken red curve) is a poor match.

plot(ecdf(y))
  curve(pexp(x,.01), add=T, col="blue", lwd=2)
  curve(pnorm(x,100,100), add=T, col="red", lwd=2)

enter image description here

The $D$-statistic of the K-S test is the largest vertical distance between the sample ECDF and the hypothetical CDF.