Why is finding the median and quartiles different on a cumulative frequency graph? (Or is it?)

634 Views Asked by At

When finding the median we look for the (n+1)/2 value in the ordered data set. For the lower quartile it's the (n+1)/4 value and the upper quartile is 3/4(n+1). However, I've been helping my girlfriend with some homework where she has to find these statistics by looking at a cumulative frequency graph. Here, n=40 and she has been told to read along from 20 on the y-axis for the media, and 10 and 30 for the quartiles. In my mind, we should read along from 20.5, 10.25, and 30.75 respectively, but looking online it seems that the way she has been taught is fairly standard. Am I missing something about the way this works with the graph?

1

There are 1 best solutions below

0
On BEST ANSWER

The usual definition of the 25th quartile is a value $q$ such that not more than 25% of observations are below $q$ and nor more than 75% are above $q.$

Depending on the number of observations and on the particular observations about a quarter of the way along the sorted data, that usual definition may result in an interval of possible 25th quartiles.

various texts and software programs have adopted different rules as to which value in such an interval to select. About ten rules are in common use, with no signs of a consensus on the horizon.

R statistical software calls the different rules 'types'; type=7 is the default, but the user can supply an argument to choose any of the others.

Suppose you have 40 observations from $\mathsf{Norm}(\mu=100,\sigma=15).$ The population lower and upper quartiles are about 89.9 and 110,1, as shown below.

qnorm(c(.25,.75), 100, 15)
[1]  89.88265 110.11735

Quartiles of a particular sample of size $n = 40$ from this distribution (rounded to two places) will be "near" these population quartiles, but results will differ as to type.

set.seed(2022)
x = round(rnorm(40, 100, 15), 2);  sort(x)

 [1]  56.49  76.62  78.33  82.40  84.11  85.27  85.74  86.54  87.11  87.74
[11]  90.19  92.23  94.78  95.03  96.42  96.69  97.22  97.47  98.80  99.21
[21] 101.39 101.77 102.13 102.35 103.62 104.17 104.92 105.47 105.75 109.75
[31] 111.24 112.47 112.89 113.50 115.09 115.29 116.15 116.19 116.70 118.17

Here are some results (look under 25% for lower quantiles). These are various ways to put a number into the gap between the end of the first line $87.74$ and the beginning of the second line $90.19$ above.

quantile(x)  # default 'type 7'
      0%      25%      50%      75%     100% 
 56.4900  89.5775 100.3000 110.1225 118.1700 
quantile(x, type=1)
    0%    25%    50%    75%   100% 
 56.49  87.74  99.21 109.75 118.17 
quantile(x, type=2)
     0%     25%     50%     75%    100% 
 56.490  88.965 100.300 110.495 118.170 
quantile(x, type=3)
    0%    25%    50%    75%   100% 
 56.49  87.74  99.21 109.75 118.17 
quantile(x, type=4)
    0%    25%    50%    75%   100% 
 56.49  87.74  99.21 109.75 118.17 
quantile(x, type=5)
     0%     25%     50%     75%    100% 
 56.490  88.965 100.300 110.495 118.170 
 quantile(x, type=6)
      0%      25%      50%      75%     100% 
 56.4900  88.3525 100.3000 110.8675 118.1700 

Below is an empirical CDF plot (ECDF) of the sample. The horizontal red line is at .25 on the vertical asis. It intersects the ECDF at a flat spot, corresponding to various choices of the 25th percentile. (lower quartile).

plot(ecdf(x))
 abline(h=.25, col="red")

enter image description here

For your girlfriend's class, it is best to use whatever rule the text or her class notes specifies. Otherwise, you are free to choose any 'type' of 25th percentile you like.

In practice quartiles and other percentiles are most commonly used for large samples, where differences in 'types' often make no practical difference.