What are some approaches to estimating at which percentile a value falls given limited summary stats and unknown distribution?

55 Views Asked by At

I know $n$, $10$th, $25$th, median, mean, $75$th, and $90$th percentile for an unknown distribution.

It is probably not normal. It generally has mean $>$ median by differing amounts.

What are some approaches for estimating what percentile some arbitrary value falls at given these limited summary stats? Is there a good web-based resource for reading on this?

For example:

  • $n = 80$
  • $10$th $= 31,220$
  • $25$th $= 38,740$
  • $50$th $= 51,580$
  • mean $= 54,700$
  • $75$th $= 67,940$
  • $90$th $= 80,290$
  • ...and I'm curious where $47,000$ would likely fall

I have a large data set with summary stats like this for each item and n for each can range from $10$s to $10000$s scale. I'm just looking for reasonable (defensible) approaches to estimate the proportion of values that exceed a arbitrary threshold and my stats are really rusty.

1

There are 1 best solutions below

0
On

There is a "Central Limit Theorem" for sample medians as well as for sample means.

Let $H_n$ denote the median of a random sample $X_1, X_2, \dots, X_n,$ with density $f(x)$ and population median $\eta.$ If $k/n \rightarrow 0.5$ (with $k - n/2$ bounded) then the sequence of medians $H_n$ is asymptotically normal with mean $\eta$ and variance $c^2/n,$ where $c^2 = \frac{1}{4f(\eta)}.$

Notice that the key condition is that $f(\eta) > 0.$ Perhaps, compare this with the asymptotic normality of $\bar X$ with mean $\mu$ and variance $\sigma^2/n,$ where you must have finite $\sigma.$

A similar theorem holds for any quantile $\theta_p$ except for the 'extreme values' (the minimum $p = 0$ and the maximum $p = 1$). Then $c^2 = \frac{p(1-p)}{f(\theta_p)}.$ [One reference is Bain and Englehardt: Intro. to Probability and Math. Statistics, Sec 7.6. I'm sure you can find others by looking for 'order statistics' in various mathematical statistics texts.]

Accordingly, you would not need an extensive summary of data from a sample, only the appropriate sample quantile from a suitably large sample.

Note: This is not to say that looking at sample quantiles is always the best way to estimate population quantiles. For example, consider the median $\eta = 1/2$ of the distribution $\mathsf{Unif}(0,1).$ The sample median $H$ of a large sample will be close to $\eta = 1/2.$ However, the sample mean $\bar X$ has a smaller asymptotic variance. So $\bar X$ will tend to be closer to $\mu = 1/2$ than will the sample median. (Better yet, use $\frac{n+1}{2n}X_{(n)},$ where $X_{(n)}$ denotes the sample maximum).