CDF of iid values after removing highest value

50 Views Asked by At

Say, I have $n$ iid draws $x$ from a distribution with continuous cdf $F$ and density $f$ on interval $[0,\overline x]$.

Then the CDF of the highest of these draws is $(F(x))^n$, and the cdf of the $k$-th highest draw would be $G_k(x) = \sum_{i=0}^{k-1} {n \choose i}(F(x))^{n-i} (1-F(x))^i$. So far, so good.

Now, suppose I remove the highest of these $n$ values, and announce that this value was $\widehat y$. Then, I can update my distribution $F$ for the remaining population conditional on the new information. I now have $(n-1)$ iid draws with cdf $F(x)/F(\widehat y)$. So far, so good.

Now, suppose I do not announce this value $\widehat y$.

Then, the cdf of a randomly picked value should be $$F_{new}(x)= \int_0^{\overline x} \min \{ 1, F(x)/F(y) \} d (F(y))^n \\ \int_0^x 1 d F(y)^n + F(x) \int_x^{\overline x} \frac{n F^{n-1}(y) f(y)}{F(y)} dy\\ F(x)^n  +  F(x) \frac{n}{n-1} \int_x^{\overline x} (n-1)F(y)^{n-2}f(y) dy\\ F(x)^n + F(x) \frac{n}{n-1} (1- F(x)^{n-1})$$ This appears reasonable, because  $F_{new}(x)=\frac{1}{n-1}\sum_{k=2}^n G_k(x)$.

Similarly, I can calculate the cdf of the highest value of the new population, which intuitively should be the cdf of the 2nd-highest value of the initial population. Indeed, $$G_{1,new}(x)= \int_0^{\overline x} \min \{ 1, (F(x)/F(y))^{n-1} \} d (F(y))^n\\\int_0^x 1 d F(y)^n + F^{n-1}(x) \int_x^{\overline x} \frac{n F^{n-1}(y) f(y)}{F^{n-1}(y)} dy\\F(x)^n  +  n F^{n-1}( x)  \int_x^{\overline x} f(y) dy\\ F(x)^n  +  n F^{n-1}( x) (1-F(x)) = G_2(x)$$

What I now find puzzling is why is $G_{1,new}(x) \neq (F_{new}(x))^{n-1}$?

If my calculations above are correct, it must be that that the values of the new population are not iid anymore. But how can this intuitively be the case given that announcing that all values are below a known threshold $\widehat y$ preserves independence?

1

There are 1 best solutions below

0
On BEST ANSWER

Your findings can be rephrased as the statement, "the remaining values are conditionally independent given $\hat{y}$," which is not the same as unconditional independence.

To illustrate the difference on a simple case, let's consider independence of events rather than random variables. Suppose that $X,Y,Z \sim Exp\left(\lambda\right)$ are i.i.d. Then,

$$P\left(X > z,Y > z\right) = P\left(X > z\right)P\left(Y > z\right)$$

for fixed $z$. However,

\begin{eqnarray*} P\left(X>Z,Y>Z\right) &=& \mathbb{E}\left(P\left(X>Z,Y>Z\mid Z\right)\right)\\ &=& \mathbb{E}\left(P\left(X>Z\mid Z\right)P\left(Y>Z\mid Z\right)\right)\\ &=& \mathbb{E}\left(e^{-2\lambda Z}\right)\\ &=& \frac{1}{3}, \end{eqnarray*}

where the second line uses conditional independence and the last line uses the exponential MGF. This is not the same as

$$P\left(X>Z\right)P\left(Y>Z\right) = \left(\frac{1}{2}\right)^2 = \frac{1}{4}.$$

The issue is that, although you can use conditional independence to write a joint probability as a product, you have to take the expected value of the entire product over the distribution of the r.v. on which you conditioned. This need not be equal to the product of the expected values. In general, the conditional distributions of $X$ and $Y$, given some information about $Z$, need not be independent any longer.