finding out percentiles from given data

108 Views Asked by At

I'm trying to improve my understanding of percentiles by just coming up with data of my own and analyzing it (as opposed to just mostly reading 'x' percentile means you performed better than x % of people'. I am using the same logic from the solution posted in the question asked here: Explanation of this percentile GRE problem.

So suppose 200 people are competing for a record in a game. 1 person achieves the record in 52 seconds, 120 people achieve it in 53 seconds, and 79 people achieve it in 54 seconds (the time is rounded to the nearest second, so, for example, each 53 would be considered equal).

Using the same logic as in the linked question, there would be 200 / 100 = 2 results per percentile. But this does not make any sense in the context of the question because only one person has the fastest time of 52. Also, even when I do divide it into groups (like so), there is still one 54 that isn't even in a group? I am not sure if I understand the logic used in the previous question anymore if I can't apply it to a new one.

How exactly would one mathematically find out percentiles in this case?

1

There are 1 best solutions below

0
On BEST ANSWER

With profoundly many ties. There are about 10 slightly different definitions of 'quantile' in common use among various reputable authors and computer programs. For your example with extremely many ties, I believe they all give the same answers. Here are computations of the median (50th percentile, quantile .5) and the first decile (10th percentile, quantile .1) in R statistical software:

x = c(52, rep(53,120), rep(54,79))
median(x)
## 53
quantile(x, .5)  # median again
50% 
53 
quantile(x, .1)
10% 
53

Very roughly speaking, 53 is the median because fewer than 100 observations are below 53 (actually 1) and fewer than 100 observations are above (actually 79).

Similarly, 53 is the first decile because fewer than 20 observations are below 53 and fewer than 180 are above.

With fewer ties. A more intricate example would be one in which completion times were recorded to the nearest hundredth of a second. For your data the sample mean is $\bar X = 53.39$ and the sample standard deviation is $S_x = .499145.$

mean(x)
## 53.39
sd(x)
## 0.499145

Suppose I simulate 200 completion times from a normal distribution with population mean $\mu = \bar X,$ population SD $\sigma = S_x,$ and results are rounded to two places. Then there are fewer ties (exactly 83), and the median (53.415) and first decile (52.738) are different, according to the specific quantile rules used by R. (The exact value for the first decile may differ slightly among statistical software packages: The 20th and 21st order statistics are 52.72 and 52.74; any number in $[52.72, 52.74]$ might be called the first decile.)

 y = round(rnorm(200, mean(x), sd(x)), 2)
 head(y); tail(y)
 ## 53.75 53.73 53.89 53.12 53.81 53.55  # first 6 observations (not sorted)
 ## 54.21 52.68 53.10 53.68 54.30 52.85  # last 6 observations (not sorted)
 ties = 200 - length(unique(y));  ties
 ## 83                                   # number of ties
 median(y)
 ## 53.415
 quantile(y, .1)
    10% 
 52.738 
 sort(y)[20:21]
 ## 52.72 52.74

Below is a histogram of my fake completion times. The tick marks below show $117 = 200 -83$ unique observations (ties overprinted). The positions of the median (red vertical line) and the first decile (green) are also indicated.

enter image description here