question related to Kolmorogov Smirnov statistic.

61 Views Asked by At

Question related to Kolmogorov Smirnov statistics:

If $F_n$ is the empirical distribution function for $n$ IID random variables with an unknown distribution function $F$, what does the random function $F_n$ look like?

What are its y-value it can take and are there some appropriate x-values in its domain?

Below is what I know.....

enter image description here

But after this, how should I proceed to get the desired result/output ?

Can someone help me on this.

2

There are 2 best solutions below

0
On

The $y$-values it can take are in the range $[0, 1]$.

Another way of writing $F_n$ is $$F_n(x) = \frac{|\{X_i \mid X_i \leq x\}|}{n},$$ where $|\{X_i \mid X_i \leq x\}|$ is the number of data points less than or equal to $x$, and $n$ is the number of data points.

The function is a step function, with $F(x) = 0$ whenever $x < \min(X_i)$, $F(x) = 1$ whenever $\max(X_i) < x$, and $F_n(x) \in (0, 1)$ otherwise. For example, suppose that $n = 4$ and $X_1 = 0.55$, $X_2 = 0.1$, $X_3 = 0.8$, $X_4 = 0.25$, then $F_n$ is shown below:

Empirical CDF example

To calculate $D_n$ for a given $F_n, F$ you just calculate $|F_n(x) - F(x)|$ wherever $F_n$ or $F$ is discontinuous.

0
On

It seems you at asking about a one-sample K-S test to see of a sample $X_i, i = 1,2,\dots,n$ is from a continuous distribution with CDF $F.$

At left is a plot (in R) of the ECDF of an exponential sample of size $n = 40$ along with the the CDF of the sampled population. At right is a comparison with the CDF of a normal population with the same mean and SD.

set.seed(1234)
x = rexp(40)
par(mfrow=c(1,2))
 hdr="ECDF with CDF of EXP(rate=1)"
 plot(ecdf(x), main=hdr)
  curve(pexp(x), add=T, col="blue", lwd=2)
 hdr="ECDF with CDF of NORM(1,1)"
 plot(ecdf(x), main=hdr)
  curve(pnorm(x,1,1), add=T, col="red", lwd=2)
par(mfrow=c(1,1))

enter image description here

Here are the corresponding K-S tests. The K-S test statistic is the maximum vertical distance between the ECDF and the CDF.

  • The K-S test correctly fails to reject the null hypothesis that the data are from $\mathsf{Exp}(1).$
  • The K-S test rejects the null hypothesis that the data are from $\mathsf{Norm}(1,1)$ at the 10% level, but not at the 5% level.

For relatively small sample sizes the K-S test does not have good power.

ks.test(x, pexp, 1)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.16877, p-value = 0.1825
alternative hypothesis: two-sided

ks.test(x, pnorm, 1, 1)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.19875, p-value = 0.07346
alternative hypothesis: two-sided

By contrast, a Shapiro-Wilk test of normality, rejects sample x as having been sampled from any normal distribution, with a P-value very near $0.$

shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.84958, p-value = 8.789e-05

In a simulation with 100,000 samples of size forty from $\mathsf{Exp}(1),$ the K-S test rejected (5% level) only about 36% of the time the null hypothesis that the sampled population is $\mathsf{Norm}(1,1).$

set.seed(2020)
pv = replicate(10^5, ks.test(rexp(40), 
                             pnorm,1,1)$p.val)
mean(pv <= .05)
[1] 0.36385

It isn't hard to tell that a sample from $\mathsf{Exp}(1)$ is not from $\mathsf{Norm}(1,1).$ The exponential distribution gives only positive values. From the normal distribution we should see about 16% negative values. To get positive values all 40 times would be very rare indeed.

pnorm(0, 1, 1)
[1] 0.1586553
(1-pnorm(0, 1, 1))^40
[1] 0.000997607