Is it possible for most (all?) items in a list to lie within one standard deviation of the mean of that list?

443 Views Asked by At

Question. Is it possible for most (all?) items of a list to lie within one standard deviation of the mean of that list?

Motivation. Fast computations of approximate medians of lists of integers is considered an important problem in computer science. My personal interest in the problem comes from the fact that if a quicksort implementation uses a good approximations to the median for choosing its pivot elements, the worst-case time complexity will be $O(n \log n)$, which is considered fast.

With that motivation established, recall that the nonparametric skew of a distribution is defined as $$\frac{\mu - \nu}{\sigma},$$ where $\mu$ is the mean, $\nu$ is the median, and $\sigma$ is the standard deviation. According to the relevant wikipedia page:

The nonparametric skew ... lies between $-1$ and $+1$ for any distribution

Obviously, this applies to lists as well, and not just probability distributions. Therefore, I'm interested in using this to quickly find the median of a list $L$ of not-necessarily positive integers. Here's the idea: if the standard deviation of $L$ is zero, the first item of the list is the median. Otherwise, we can omit elements that aren't one standard deviation of the mean, recalculate $\mu$ and $\sigma$, and repeat the process. A related idea is to stop after a certain portion of the list items have been omitted by this process, in the hope that a randomly chosen element of the remaining elements will provide a good approximation to the median.

Here's the problem: the process of omitting elements needs to be guaranteed to omit a decent portion of the list at each iteration. Otherwise, the worst-case time complexity of this strategy will be very bad. Indeed, if there exist lists in which all the points are within one standard deviation of the mean, the aforementioned algorithm will never terminate for those lists.

2

There are 2 best solutions below

5
On BEST ANSWER

In a degenerate case, it's possible: in a list of integers which has only two values $a$ and $b$, appearing equally often, the mean will be $\frac{a+b}{2}$ and the (population) standard deviation will be $\frac{|a-b|}{2}$, and both values will be within a standard deviation of the mean.

We can prove that this is the only situation in which this happens (for a probability distribution). If $\Pr[|X-\mu| \le \sigma] = 1$, then $\Pr[(X-\mu)^2 \le \sigma^2] = 1$, so the random variable $(X-\mu)^2$ must almost surely be at most its expected value: $\sigma^2$, by definition. This can only happen if it's almost surely equal to its expected value $\sigma^2$, so we must have $X \in \{\mu-\sigma, \mu+\sigma\}$ with probability $1$.


More concerning are situations where most values lie within a standard deviation of the mean, which can happen easily when the data has extreme outliers. We might hope that in the outlined median-finding algorithm, one phase will get rid of outliers, and then the next phase will work much better. But I can imagine scenarios where that won't happen.

For example, consider the following data set of $2n+1$ elements, with mean and median $0$: $$ -2^{2^n}, -2^{2^{n-1}}, -2^{2^{n-2}}, \dots, -2^{2^1}, 0, 2^{2^1}, \dots, 2^{2^{n-2}}, 2^{2^{n-1}}, 2^{2^n}. $$ (Here, $f(x) = 2^{2^x}$ is just one possible extremely fast-growing function.)

The extreme values $\pm 2^{2^n}$ are so large that they make the standard deviation extremely large, which means that all other values fall within a standard deviation of the mean. So the algorithm will drop those two extremes from our data set, and leave the other $2n-1$ values. But then, we're on the $n-1$ case of the same pattern, and the same behavior happens again; it takes $O(n)$ phases to identify the median of $0$.

Whether this is likely or not depends on the kind of distribution your numbers come from.

2
On

Is it possible for most (all?) items of a list to lie within one standard deviation of the mean of that list?

It is absolutely possible for most items of a list to be within $\sigma$ of the mean. How likely that is depends on what you mean by "most". Details are a function of the discrete distribution your list represents, but if, say, it approximates a sample from a normal distribution then you would expect about two thirds of the list elements to be within $\sigma$ of the mean. There are plenty of other ways in which the same proportion or more of the list elements could be within $\sigma$ of the mean.

To gauge the potential power of your approach, it might also be useful to look at how much you could expect to reduce the search space for the median when the data have other kinds of distributions. In particular, if the distribution of your elements is approximately uniform, then you can still expect more than half of them (around 58%) to be within $\sigma$ of the mean.