Sample complexity of quantile approximation

146 Views Asked by At

(I am re-posting from cross-validated where this questions did not get any answers)

I was looking into the sample complexity of $\epsilon$-approximate quantiles. The problem is defined as when having distribution and looking for quantile $\phi$, one has to answer with an element which has quantile between $\phi - \epsilon$ and $\phi + \epsilon$.

I have found the paper "Approximate Aggregation for Tracking Quantilesin Wireless Sensor Networks" which claims that "The work in [27] shows that a random sample of size $Θ(1/\epsilon^2 )$ is needed to be drawn from a dataset to compute $\epsilon$-approximate quantiles with a constant probability". However, the referenced paper [27] is the paper "ON THE UNIFORM CONVERGENCE OF RELATIVE FREQUENCIES OF EVENTS TO THEIR PROBABILITIES". Not only does this paper not claim anything about quantile estimation but it does not seem to talk about lower-bounds.

Is that a mistake in the paper I found or am I missing something? Or is there some other way to easily see that the complexity is $\Omega(1/\epsilon^2)$?

1

There are 1 best solutions below

2
On

Take a specific example. Suppose your distribution is $\mathsf{Norm}(\mu=100, \sigma=15).$ You want to take a sufficiently large sample from this distribution to get at least one observation between the 74th and 76th percentiles. That is between quantiles $.75 \pm .01.$

With $\epsilon = .01$ we might have to take as many as $1/\epsilon^2 = 10\,000$ observations.

Specifically, we need to have at least one observation in the interval $(109.6502, 110.5945).$ [Computations and simulations in R.]

q = qnorm(c(.74,.76), 100, 15); q
[1] 109.6502 110.5945

Because the 75th percentile is in a fairly dense part of this distribution, I'll see if I can get by, sampling only $1000.$

set.seed(628)
x = rnorm(1000, 100, 15)
summary(x)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  50.15   90.36   99.69  100.09  110.35  150.90 

Observations matching our specifications are found as follows:

x[x>q[1] & x<q[2]]
 [1] 110.1191 109.9076 110.0786 110.1584 109.8379 109.9022 110.2740
 [8] 110.3990 109.8314 110.4582 110.1659 109.9493 109.8731 109.9490
[15] 110.3925 109.9099 110.1190 110.4647 109.8851 110.3344

From the summary we already know that the upper quartile (75th percentile, quantile 0.75) is 110.35. If 110.35 isn't exactly one of the observations, there are likely to be a couple of observations close by.

If I had been looking for the $98$th percentile $\pm 1,$ where observations are less dense, maybe 1000 observations wouldn't have been enough.

Notes: (a) I'm not exactly sure about some of your notation. I guessed that $\Theta(1/\epsilon^2)$ might have been big-O notation. (b) It may have been a mistake in your posting here and on our sister 'cross-validated' site to mention a reference that you're not sure is really relevant. (c) Anyway, I hope some parts of my Answer are near enough to what you need that you can figure out any remaining details.