Maximizing the p-value associated with a goodness-of-fit test

246 Views Asked by At

Problem (concrete version): Let $p = p(T, \mathcal{D}, \mathbf{X})$ be the $p$-value associated with goodness-of-fit test $T$ (i.e. Kolmogorov-Smirnov, Anderson-Darling, Cramér–von Mises) on distribution $\mathcal{D}$ (i.e. normal, beta, gamma) with sample points $\mathbf{X} = (X_1, \cdots, X_n)$. Suppose we know $T, \mathcal{D}$ and $n$ is fixed. Is there a way to solve \begin{align*} \underset{\mathbf{X}}{\text{argmax}} \quad p(T, \mathcal{D}, \mathbf{X}) \end{align*} Problem (more abstractly): Given a distribution $\mathcal{D}$, how can I select $n$ points which is "most representative" of that distribution? The problem I am working on requires selecting points from such distribution, but not sampling points. That is, we cannot have these points differ between experiments, hence why we would like to get the most representative points the first time around. In terms of something more concrete, I defined representativeness as the $p$-value associated with a goodness-of-fit test, but I'm certainly open to other ideas, especially one with a simple analytical solution.

Examples: Let $T = $ Kolmogorov-Smirnov and $\mathcal{D} = \text{Unif}(0, 1)$. For $n = 1, 2, 3$, we may calculuate in R:

> x = c(0.5)
> ks.test(x, "punif")

    One-sample Kolmogorov-Smirnov test

data:  x
D = 0.5, p-value = 1
alternative hypothesis: two-sided

> x = c(0.25, 0.75)
> ks.test(x, "punif")

    One-sample Kolmogorov-Smirnov test

data:  x
D = 0.25, p-value = 1
alternative hypothesis: two-sided

> x = c(1/6, 0.5, 5/6)
> ks.test(x, "punif")

    One-sample Kolmogorov-Smirnov test

data:  x
D = 0.16667, p-value = 1
alternative hypothesis: two-sided

Of course, this is a relatively easy example where I could guess the maximum arguments. But more generally, this is not the case. We could perform a grid-search, but the dimension of this grid-search grows with $n$, and it's complicated further if $\mathcal{D}$ itself is a multivariate distribution.

Thanks!

1

There are 1 best solutions below

0
On BEST ANSWER

Comments:

If you are going to explore with ks.test in R, for sample sizes 100 or greater, you should use exact=T, otherwise you will get an approximate P-value that may be misleading.

You are probably aware that for random data under $H_0,$ the P-value of a test statistic is distributed $Unif(0,1).$ Ordinarily, one rejects for P-values less than 5%. But if one is checking the validity of a (pseudo)random number generator or of the programming to produce random variables (other than uniform), then one should be equally suspicious of P-values that are too large because they indicate 'random' events that are too good to be true. [For example, in a chi-sq goodness-of-fit test for fairness of a die if the numbers of of the respective spots 1 through 6 reported after 600 (alleged) rolls are precisely $(100, 100, 100, 100, 100, 100),$ one would be suspicious.]

Below is a histogram of P-values of K-S statistics from tests on 100,000 exponential samples.

set.seed(1776);  m = 10^5;   pv=numeric(m)
for(i in 1:m) {
   x  = rexp(100)  # exponential data
   pv[i] = ks.test(x, "pexp", alternative="t", exact=T)$p.value }
hist(pv, prob=T, xlim=c(-.1,1.1), col="skyblue2")
  curve(dunif(x), -.2, 1.2, n=10001, lwd=2, ,col="blue", add=T)
  abline(v=.05, lwd=2, lty="dashed", col="red")

enter image description here

So you might want to ponder the motivation underlying your quest for large P-values. They do not necessarily indicate 'best' fits in any kind of practical sense. Nevertheless, ...

x = seq(0,1, length=100)
ks.test(x, "punif",exact=T)$p.value
## 1
y = -log(x)  # inverse CDF transf of UNIF yields EXP
ks.test(y, "pexp",exact=T)$p.value
## 1
w = x^2      # square of UNIF yields BETA(.5,1)
ks.test(w, "pbeta", .5, 1, exact=T)$p.value
## 1
z = qnorm(x)
ks.test(z, "pnorm", exact=T)$p.value
## 1

More generally: The first step of a K-S test is to transform to $Unif(0,1)$, so the quantile (inverse CDF) transformation of a dataset with 'perfect' fit to $Unif(0,1)$ , which can be almost any distribution you like, will also show perfect fit to that distribution.