Kolmogorov-Smirnov Test ($KS$-test)

331 Views Asked by At

Background Information:

Starting with the sample $X_1,\ldots, X_{N}$ and sort the sample so that $X_1\leq X_2\leq \cdots \le X_N$. In our case the data set $x_1 = 0.2$, $x_2 = 0.6$, $x_3 = 0.7$. Suppose $X\sim \mathcal{U}(0,1)$ then the cumulative distribution function for $X$ is $$F(x) = x $$ We have $$D_N = \sum_{-\infty < x < \infty}|F_N(x) - F(x)|$$ where $$F_N(x) = \begin{cases} 0 \ &\text{if } x < X_1\\ k/N \ &\text{if } X_k\leq x < X_{k+1}\\ 1 \ &\text{if } x > X_N \end{cases}$$ The first and last terms are $$\sup_{x < X_1}|-F(x)| = F(X_1)$$ $$\sup_{x > X_N} |1 - F(x)| = 1 - F(X_N)$$ For the other terms, observe that $$\sup_{X_k\leq x < X_{k+1}}\left|\frac{k}{N} - F(x)\right| = \max\left(F(X_{k+1}) - \frac{k}{N},\frac{k}{N} - F(X_k)\right); k = 1, \ldots, N - 1$$

$D^{+}$ and $D^{-}$: $$\begin{aligned}D_{N}&=\max\left[F(X_{1}),\max_{k=1,\ldots,N-1}\left(F(X_{k+1})-\tfrac{k}{N},\tfrac{k}{N}-F(X_{k}),1-F(X_{N})\right)\right]\\ &= \max\left[\underbrace{\max_{k=1,\ldots,N}\left(\tfrac{k}{N}-F(X_{k})\right)}_{D^{+}},\underbrace{\max_{k=1,\ldots,N}\left(F(X_{k})-\tfrac{k-1}{N}\right)}_{D^{-}}\right]\\&=\max\{D^{+},D^{-}\}\end{aligned}$$ The formulas simplify if $X\sim\mathcal{U}(0,1)$ since then $F(x)=x$.

Question:

Compute $D_N = \max\{D^{+},D^{-}\}$ for the data set $x_1 = 0.2$, $x_2 = 0.6$, $x_3 = 0.7$. Take $F$ to be the c.d.f. of $U(0,1)$; the uniform distribution on $(0,1)$. (Do these computations by hand - no computer code.) What do you think $D^{+}$, $D^{-}$, $D_N$ measure, intuitively?

Attempted solution - So in our case when $x_1 = 0.2$ then $$D_1 = F(x_1) = 0.2$$ and $$D_3 = F(x_3) = 0.7$$

I am not sure how to get $D_2$, or if I am doing this correctly or not. Any suggestions are greatly appreciated, I am not really sure about the intuition on the last part of the question.

1

There are 1 best solutions below

2
On BEST ANSWER

I am somewhat unsure about one detail in your question: in the standard KS test, the $D_{N}$ quantity is defined as $D_{N}=\sup_{x\in\mathbb{R}}|F_{N}(x)-F(x)|$. I think the expression you give is a typo in MathJax from "\sup" to "\sum". My answer below is for "\sup".

With this, your $F_{N}$ is 'empirical' cdf calculated from your data and $D_{N}$ is the largest absolute difference between the empirical cdf and the theoretical cdf. (Your $D_{1}$ and $D_{3}$ make no sense to me since there are $3$ observations, that is, $N=3$.)

What is the empirical cdf? It is, substituting your data into your definition of $F_{N}$: $$F_{N}(x)=\begin{cases} 0&\text{ if }x<0.2\\ \frac{1}{3}&\text{ if }x\in[0.2,0.6)\\ \frac{2}{3}&\text{ if }x\in[0.6,0.7)\\ 1&\text{ if }x\geq0.7\\ \end{cases}$$

What the question asks you to calculate is $D_{N}$. (The hint about the first and the last term are supposed to help in that.) Let me try to help with the figure below (its straightforward to draw it by hand, since you are prevented from using code). The blue line is the theoretical cdf, the magenta line is the empirical cdf and the yellow line is the absolute difference between your two. Your $D_{N}$ is the $\sup$ of this function. I think it is $0.3$ at $x=0.7$.

enter image description here

How to get at $D_{N}=0.3$? More systematic approach is to calculate the $D^{+}$ and the $D^{-}$. You have $D^{+}=\max_{k=1,\ldots,N}\frac{k}{N}-F(X_{k})$ and $D^{-}=\max_{k=1,\ldots,N}F(X_{k})-\frac{k-1}{N}$. What are these? Both $D^{+}$ and $D^{-}$ focus on the difference between the empirical and the theoretical cdf at the data values (since this is where the difference is the largest). At each data point, the empirical cdf jumps up and hence has "lower" value in the left neighborhood of the data point and has "higher" value in the right neighborhood of the data point. $D^{+}$ is the difference between the higher value of the empirical cdf and the theoretical cdf. $D^{-}$ is the difference between the lower value of the empirical cdf and the theoretical cdf. Loosely speaking, $D^{+}$ is the largest positive difference between the two cdfs and $D^{-}$ is the largest negative difference.

With the help of the figure, substituting into the definition, one can get: $$\begin{aligned} D^{+}&=\max\{\tfrac{1}{3}-0.2,\tfrac{2}{3}-0.6,\tfrac{3}{3}-0.7\}\approx\{0.13,0.067,0.3\}\\ D^{-}&=\max\{0.2-\tfrac{0}{3},0.6-\tfrac{1}{3},0.7-\tfrac{2}{3}\}\approx\{0.2,0.26,0.03\}\\ \end{aligned}$$