How to tell if two samples come from the same probability distribution?

2k Views Asked by At

I have two distributions (generated from binned data) and wish to answer the question: Do they come from the same underlying distribution? I don't have the form of the underlying distribution.

Edit: Really I have ~ 30,000 samples (of some physical quantity, say v) distributed in some large physical space. The distribution I care about is in some very small subset of this physical space that has essentially zero sample points. I'd like to characterize the variability of the v distribution in physical space (or on different physical scales).

Most people use the distribution in the large space. But since I have so much data, I've broken the large space down into about 700 smaller spaces which overlap. I want to know if these 700 small space distributions (~3000 samples) are drawn from a common underlying distribution, and if so if it is the same underlying distribution as the large space's distribution is drawn from.

Part of my worry is that these data all have some (in the case of the smaller space distributions) common points, or in the most extreme case the sample points are a subset of the large space distribution.

2

There are 2 best solutions below

0
On

Pearson's chi-squared test was invented exactly for this kind of task.

0
On

The best idea is probably to use probability plotting, if there are enough observations (at least 20-30, maybe. If not enough observations any test will be doomed.) The commenter mentioning Kolmogorov-Smirnof test: That is mostly a nice theoretical idea, which do not work well in practice! It is simply not powerfull enough.

Below I show how you can do this in R. First an example where we simulate from two different distributions:

> x <-  runif(100)
> y <-  runif(150)
> qqplot(x, y)

(yes, there is no need to assume equal sampling sizes!) Probability plot of two independent samples from uniform distribution

Then we can simulate from two different distributions:

> x <-  runif(100)
> y <-  rbeta(100, 1, 2)
> qqplot(x, y)
> qqplot(x, y)
> abline(0, 1, col="red2")

Probability plot two independent samples, uniform and beta doisytributiosn

As you can see from the last plot, the deviance from the straight line is quite clear. If you need a formal test, you can use the correlation coefficient calculated from the plot. Cutoffs for significance tests could be found by simulation! I am quite sure that would give a moore powerful test than Kolmogorov-smirnov.