What's the expected size of sampling for estimating percentile in a dataset?

32 Views Asked by At

Problem definition:

  • D: a size n set of real numbers

  • M: a size m set of real numbers sampled from D without replacement,
    each number in D has equal probability to appear in M.

  • $m \leq n$

I want to estimate percentile $q \in [0,1]$ for n. That is: I want a number x such that there are $qn$ numbers in D less than x. Since n is too large, I will use quantile in M to approximate quantile in D.

I want to estimate the expected value of m, such that I have $1-\delta$ probability of having my estimation y with error $|\hat{q} - q| \leq \epsilon$. Where $\hat{q}$ is the ground truth quantile of y in D.

Some paper say m is at least $\Theta(\frac{1}{\epsilon^2}log(\frac{1}{\delta}))$, can anybody help me to prove that?

Example:

D: [1,2,3,4, ... , 100]

n = 100

M: [2, 70, 90]

m = 3

q = 0.5(this means I want the median of D)

x = 50

y = 70

$\hat{q} = 0.7$

$|\hat{q} - q| = 0.2$