Testing whether a Sample is from a known Population

49 Views Asked by At

I was wondering if I could ask a question to the Stats Folk.

Consider a random variable $Y$ with a given $E[Y]$ and $V[Y]$ that is known.

Now consider that a set of sample data is collected $X$, it itself has it's own sample $E[X]$ and $V[X]$ and sample size $n$.

My question, is there a statistical test to determine whether $X$ is a true sample of $Y$? Or some sort of assessment of this?

We very often see analysis that allows for a sample to be analysed under the assumption it's from a certain population, but I've been unable to find any that test where $X \subseteq Y$.

2

There are 2 best solutions below

3
On

If you know not only the mean and variance but also the true theoretical distribution of Y and the empirical distribution of X, you can apply Kolmogorov-Smirnov test. This test basically just tests how close are these two distributions to each other.

0
On

Here are summary statistics for three samples, each of size $n = 500,$ sampled from three different distributions.

summary(a); sd(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.3961  0.3281  0.9793  1.0018  1.6610  4.1959 
[1] 1.034814     # Std Devn: a
summary(b); sd(b)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001137 0.315328 0.747986 0.993007 1.410341 6.023057 
[1] 0.9158659    # Std Devn: b
summary(c); sd(c)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003384 0.518120 1.037977 1.009476 1.490161 1.997519 
[1] 0.5622965    # Std Devn: c

Now I have another sample x of size 200, and I wonder if it might have come from the same distribution of one of a, b, or c.

summary(x);  sd(x)
   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001541 0.311374 0.771930 1.120292 1.642017 5.614681 
[1] 1.089492   # Std devn x

There are some similarities among the samples: all four have means near 1. Samples b, c, and x take only positive values. Samples b and x have standard deviations near 1. So just from the usual descriptive statistics, x seems similar to b, but not especially similar to a or b.

Trying to match histograms of various samples of intermediate sizes often does not help solve puzzles like this, because the binning can produce misleading effects.

However, comparing empirical CDFs can sometimes be more fruitful. To make the ECDF of a sample: sort observations in order, and make a stairstep function that increases by $1/n$ at each sorted observation (jumps of $2/n$ or $3/n$, etc. in case there are ties). So the ECDF starts at 0 at the left edge of the graph and reaches 1 at the right edge. (The ECDF of a sufficiently large sample approximates the CDF of the distribution from which it was sampled.)

Here are ECDFs of samples a, b, and c. The ECDF of sample x is superimposed on the each of these ECDFs to see if there are any likely matches.

enter image description here

Pretty clearly x does not match a or c, but it seem a possible match to b.

As suggested by @Andrew (+1), we do Kolmogorov-Smirnov tests for formal tests of matches. A P-value above 0.05 is an indication of a possible match; P-values below 0.05 pretty clearly indicate no match. Below I use $-notation to show only the P-values, and not other details of the K-S tests.

ks.test(a,x)$p.val   
[1] 0.001911254      # no match
ks.test(b,x)$p.val
[1] 0.5045157        # possible match
ks.test(c,x)$p.val
[1] 0.0003169227     # no match

In practice, with data from real-world situations, here is no way to know for sure that two samples come from exactly the same population, but the K-S test is a good way of quantifying the agreement of ECDFs, which is one of the best ways to judge possible matches.


Note: In this particular example, we can know that samples b and x came from the same population because the data are simulated.

The four samples were simulated in R, using the code below:

set.seed(1234)
a = round(rnorm(500, 1,  1), 6)
b = round(rexp(500, 1), 6)
c = round(runif(500, 0, 2), 6)
x = round(rexp(200, 1), 6)