Arguing that the distribution is not normal

268 Views Asked by At

Assume someone claims that they sampled i.i.d. $x_1, \ldots, x_k$ (let's say $k=10$) from standard normal distribution $\mathcal{N}(0, 1)$. They claim that the sampled values are exactly $x_1=x_2= \dots = x_k = 0$. My reaction would be "you are lying", i.e. there is no way in hell $x_1, \ldots, x_k$ would be exactly the same infinite-precision numbers. Am I correct and how do I show that formally?

Intuitively, I want to say that the probability that we sample the same number $10$ times has probability $0$. The problem is that I can make this argument for any sequence, since probability of getting any fixed sequence is $0$. In fact, sequence $x_1=\cdots=x_k=0$ maximizes $p(x_1) \cdot p(x_2) \cdots p(x_k)$, where $p$ is the PDF (which also doesn't seem to imply anything).

What is the formal argument? I'm stuck on even formulating the exact mathematical statement I want to say (as I explained above, "the probability of getting this sequence is $0$" doesn't imply anything). While using normality tests I can upper-bound the probability of this happening with some small positive number, I want to show that the probability of this happening is $0$.

I think this should be a standard question. In this case, the reference is appreciated.

EDIT after the answer was posted: To clarify, I would like an argument that captures the following case: assume that we sampled $k=2$ points and that $x_2 - x_1$ is rational. Then I want to say that this event still has probability $0$. I do can say that "the probability of having a rational distance is $0$". The problem is the same as before: for any fixed difference I can similarly say that the probability of having exactly this difference is $0$.

3

There are 3 best solutions below

3
On BEST ANSWER

You yourself just explained why no formal argument is available to prove what you want to prove:

Assume someone claims that they sampled i.i.d. $x_1, \ldots, x_k$ (let's say $k=10$) from standard normal distribution $\mathcal{N}(0, 1)$. They claim that the sampled values are exactly $x_1=x_2= \dots = x_k = 0$. My reaction would be "you are lying", i.e. there is no way in hell $x_1, \ldots, x_k$ would be exactly the same infinite-precision numbers. Am I correct and how do I show that formally?

Intuitively, I want to say that the probability that we sample the same number $10$ times has probability $0$. The problem is that I can make this argument for any sequence, since probability of getting any fixed sequence is $0$. In fact, sequence $x_1=\cdots=x_k=0$ maximizes $p(x_1) \cdot p(x_2) \cdots p(x_k)$, where $p$ is the PDF (which also doesn't seem to imply anything).

Because ( as specified in your comments to @lulu ) you are working with a continuous distribution and an idealized, infinite-precision sampling process, every particular outcome $\Bbb{R}^k \ni \vec{x} := (x_1, x_2, ..., x_k)$ has probability zero.
(In fact, $\vec{0} := (0, 0, ..., 0)$ is the modal outcome of the probability density function of a normally distributed $k$-dimensional random variable $\vec{X_k} \sim \mathcal{N}(\vec{0}, 1)$, so in some sense this is the outcome we should least doubt!)
Even if you want to say that the probability of $\vec{X_k}$ lying in some hyperplane $x_n = x_m$ is $0$, which is certainly true, (a) that doesn't establish impossibility and (b) every potential outcome lies in infinitely many hyperplanes, all of which individually have probability zero.
This is why @lulu was trying to help you out and point you in the direction of finite precision, hypothesis testing, and significance levels, because only an actually impossible result (and not just a probability zero one) could totally rule out the observed data having some particular continuous distribution.

2
On

The sequence $(0, \dotsc, 0)$ has the greatest product of probability densities of any sequence, but its probability is still $0$, just like every other sequence. That being said, if we did sample $k$ exact values from a normal distribution, and all of them were exactly $0$, you would be right to assume that there's something wrong with the sampling mechanism.

A sequence of $k$ zeros would be far more surprising than, say, a sequence of $k$ irrational numbers with no discernible pattern. However, the reason for this has nothing to do with probability. It's just that, as humans, we see $0$ as a very special, commonly encountered number. $0$ means nothing special to a random variable sampled from a normal distribution, so to see it appear in such a random variable is particularly surprising.

13
On

Let $X_1,X_2,...,X_k$ be a set of jointly independent and identically distributed random variables. Now, after performing some experiment suppose we obtain a sample $\{ x_1,x_2,...,x_k \}$ of $k$ results such that $X_i = x_i$ for $i=1,2,...,k$.

Even if some or all results in the sample are the same exact value, we can perform one of several statistical tests to determine whether the underlying distribution of $X_i$ is indeed normal:

  • The Shapiro-Wilk Test tests the null hypothesis that the results $x_1,x_2,...,x_k$ were sampled from a normally distributed population. Hence, if the p-value of the corresponding test statistic is sufficiently low (that is, less than your chosen Type I error $\alpha$), then you may reject the null hypothesis and conclude the data did not originate from a normal distribution (with probability $\alpha$ of being mistaken). This test actually performs quite well relative to the others below, especially for small to moderate sample sizes.
  • The (One-Sample) Kolmogorov-Smirnov (K-S) Test tests the null hypothesis that the results $x_1,x_2,...,x_k$ were sampled from a given reference distribution (normal or otherwise). It does this by quantifying the difference between the empirical cumulative distribution function of the observed data and the cumulative distribution function of the given reference distribution. This test typically requires larger sample sizes to perform as good as others on this list.
  • The Anderson-Darling Test tests the null hypothesis that the results $x_1,x_2,...,x_k$ were sampled from a given reference distribution (normal or otherwise). It works similarly to the One-Sample K-S Test, but it places more weight on the tails of the given distribution. It is particularly sensitive to detecting departures from normality.
  • The D'Agostino's $K^2$ Test tests the null hypothesis that the results $x_1,x_2,...,x_k$ were sampled from a normally distrubited population. It combines information regarding sample skewness and kurtosis to assess a departure from normality that would otherwise be expected had the sample come from a normal distribution.
  • The Lilliefors Test tests the null hypothesis that the results $x_1,x_2,...,x_k$ were sampled from a normally distrubited population. It is particularly useful in situations where the mean and variance of the parent distribution is unknown. Somewhat similar to the One-Sample K-S Test, its test statistic quantifies the maximum difference between the empirical distribution function of the observed data and the normal cumulative distribution function; however, unlike the One-Sample K-S Test, the Lilliefors test is better suited for smaller sample sizes.

Again, any of the above statistical tests will convey the probability that the samples $x_1,x_2,...,x_k$ come from a normal distribution, even if some or all of the samples happen to be the same value.

You are perfectly correct in presuming the results $x_1 = x_2 = ... = x_k = 0$ are suspicious given the claim they were independently sampled under a standard normal distribution, and any one of the above tests would back you up on that, especially for large $k$. The reason is as $k$ increases the likelihood of observing a sample variance $s^2 > 0$ increases as well. Hence, we expect there to be variation in the observed results.