How to find the height of the average person that is taller than $n$ people?

1k Views Asked by At

Assume we have a population $X$ following a normal distribution.

We pick a random person $x_p$ and we fix him. We then pick a random person $x_1$. The probability that $x_p > x_1$ is 0.5.

We pick yet another person $x_2$ and once again $P(x_p > x_2) = 0.5$

Thus the probability that the picked person is higher than both the other people is $0.5^2$ and in general, for $n$ people it is $0.5^n$.

This is the probability of being taller than $n$ people labelled $p_n$. Now I am interested on finding the height $h_n$ of the average person that is taller than $n$ people. In other words, $p_n$ tells you how likely you are to be taller than $n$ people if you don't know your height. We now want to know what's your height likely to be if you are indeed taller than n people, (i.e you are taller than x cm).

Edit:

To the people saying the probabilities are dependent, I don't think that's correct. Assume the height of someone is $\mu$, the average of the population. If you sample one person at random, the probability that that person is taller than $\mu$ is 50% (by definition). If you do this whole process again, the probability is still 50%. Trivially if you pick $n$ people at random, the probability that all of them are of height $\mu$ is $0.5^n$

It's slightly different to calculate the probability of picking the tallest person among $n$ people, than it is to ask, what's the probability that someone with unkown height is taller than $n-1$ other people.

2

There are 2 best solutions below

0
On BEST ANSWER

The issue here is whether you treat $x_p$ as a random variable, or as a fixed realization. Your treatment is not clear or consistent in the description of your question, thus the confusion.

When you write $$\Pr[x_p > x_1],$$ you are writing an unconditional probability of an event on two iid random variables $x_p$ and $x_1$. The resulting value will be a function of the parameters from which the distribution is drawn, but not of the random variables $x_p$ and $x_1$. But if you had written instead $$\Pr[x_p > x_1 \mid x_p],$$ then this is a conditional probability in which the answer will in general be a function of $x_p$. You seem to conflate the meaning of these two.

In particular, in the first case if $x_1, x_p$ are iid random variables following a $\operatorname{Normal}(\mu, \sigma^2)$ distribution, then their difference is $$x_p - x_1 \sim \operatorname{Normal}(0, 2\sigma^2).$$ Consequently, $$\Pr[x_p > x_1] = \Pr[x_p - x_1 > 0] = \frac{1}{2},$$ trivially. However, in the second case, $$\Pr[x_p > x_1 \mid x_p] = \Phi\left(\frac{x_p - \mu}{\sigma}\right),$$ where $\Phi$ is the CDF of the standard normal distribution. If by chance we had observed $x_p = \mu$, then $\Phi(0) = 1/2$, but otherwise, this probability is a strictly increasing function of $x_p$.

Now we can begin to see what's going on. If you asked for $$\Pr[(x_p > x_1) \cap (x_p > x_2)],$$ this is not equivalent to $$\Pr[x_p > x_1] \Pr[x_p > x_2]$$ because the event $(x_p > x_1)$ is not independent of the event $(x_p > x_2)$ when $x_1, x_2, x_p$ are iid. If you are not convinced, consider the explicit calculation for the simple case $\mu = 0$, $\sigma^2 = 1$: $$\Pr[(x_p > x_1) \cap (x_p > x_2)] = \int_{y_1 = -\infty}^\infty \int_{y_2 = -\infty}^\infty \!\!\!\! \Pr[(x_p > y_1) \cap (x_p > y_2)] f(y_1)f(y_2) \, dy_2 \, dy_1$$ where $f$ is the PDF of the standard normal. I will not show all the steps of the computation, but the answer to this is $1/3$; e.g. via Mathematica

Integrate[(1 - CDF[NormalDistribution[0, 1], Max[y1, y2]]) 
   PDF[NormalDistribution[0, 1], y1] PDF[NormalDistribution[0, 1], y2],
   {y2, -Infinity, Infinity}, {y1, -Infinity, Infinity}]

But whatever the exact value, it should have been obvious from the outset that independence of these events is not implied by independence of the sample. Moreover, you can immediately see that $$\Pr[(x_p > y_1) \cap (x_p > y_2)] = \Pr[x_p > \color{red}{\max(y_1, y_2)}],$$ not $$\Pr[x_p > x_1]\Pr[x_p > x_2].$$ This clearly demonstrates the lack of independence. of events.

However, if we were to consider the conditional calculation, i.e. $$\Pr\left[\bigcap_{i=1}^n (x_i > x_p) \mid x_p \right],$$ it is correct that independence applies because $x_p$ is no longer random--you've conditioned on it, so the above is equivalent to $$\prod_{i=1}^n \Pr[x_i > x_p \mid x_p].$$

2
On

The distribution of heights does not matter for this argument. All that matters is that they are strictly linearly ordered, that there are no ties. The probability that a third person is taller than the first two is not $\frac 14$ but $\frac 13$ by symmetry. Your calculation assumed that the chance the third person is taller than the first and taller than the second are independent, but it is not. Generally, if you pick $n$ people, the chance that a specific one is tallest is $\frac 1n$.

We can ask what is your expected rank if you are the tallest of $n$ people. I have seen that it is $\frac n{n+1}$ because all the intervals are expected to be the same size.