90th percentile of a randomly included sample

34 Views Asked by At

I have a set, $S_1$, of $n$ numbers, $a_1 < a_2 < ... < a_n$.

I have a process whereby each number has a $p=50\%$ chance of inclusion in the next set, $S_2$.

My final value is then min($S_2$).

What is the 90th percentile of min($S_2$) as a function of $a_i$ and $p$?

1

There are 1 best solutions below

0
On BEST ANSWER

Assume the inclusion process is independent for each number

$a_i$ is the minimum if and only if $a_1, a_2, \ldots, a_{i-1}$ are not included and $a_i$ is included. Therefore

$$ \Pr\{\min S_2 = a_i\} = (1 - p)^{i-1}p, i = 1, 2, \ldots, n$$

and there is a case where $S_2$ is empty with probability $(1-p)^n$. In such case define the minimum (actually infimum) to be an arbitrary number $a_0$ which is less than $a_1$ and should not affecting the calculation, provided that $p$ is not too small.

The CDF of $\min S_2$ is

$$ \begin{align} &F(x) = \Pr\{\min S_2 \leq x\} \\ &= \begin{cases} 0 & \text{if} & x < a_0 \\ (1 - p)^n & \text{if} & a_0 \leq x < a_1 \\ \displaystyle (1-p)^n + \sum_{j=1}^{i-1} (1-p)^{j-1}p = (1-p)^n + 1 - (1-p)^{i-1} & \text{if} & a_{i-1} \leq x < a_i, i = 2 ,3, \ldots,n \\ 1 & \text{if} & x \geq a_n \end{cases} \end{align}$$

The definition of percentile/quantile for discrete distribution is not unique. https://en.wikipedia.org/wiki/Percentile#Definitions

In general all of them have a similar idea that you search for the ("psuedo"-)inverse of the CDF $F$, the quantile function. As you see there are jumps in the CDF of a discrete distribution, when you want to get the $k$-th percentile by solving

$$ F(x) = \frac {k} {100} $$

there will be no solution for a particular $k$ in general. We can still attempt to solve it for a fake, non-integer $i$, for non-extreme case, to see it jump over $k$-th percentile at which support point:

$$ \begin{align} & & (1 - p)^n + 1 - (1 - p)^{i-1} &= \frac {k} {100} \\ &\Rightarrow & (1 - p)^{i-1} &= (1 - p)^n + 1 - \frac {k} {100} \\ &\Rightarrow & (i - 1)\ln(1 - p) &= \ln\left[(1 - p)^n + 1 - \frac {k} {100}\right]\\ &\Rightarrow & i &= 1 + \frac {1} {\ln(1 - p)} \ln\left[(1 - p)^n + 1 - \frac {k} {100}\right] \end{align}$$

Say in your case, put $k = 90$ and $p = 1/2$, we have

$$ i = 1 - \frac {1} {\ln 2} \ln\left( \frac {1} {2^n} + \frac {1} {10} \right) $$

This is an increasing function of $n$ and for large $n$, $i \approx 4.321928$ (for $n > 25$ this is a very good approximation already).

As the function $(1 - p)^n + 1 - (1 - p)^{i-1}$ is increasing in $i$, we know that when $i = 5$, i.e. $a_4 \leq x < a_5$, $F(x) > 0.9$, whereas when $a_3 \leq x < a_4$, $F(x) < 0.9$.

So for nearest rank, you may round it off to obtain $a_4$ (or $a_5$) as your percentile. Or a linear interpolation of them, like $0.321928 a_5 + 0.678072 a_4$