Why don't we use the mean of the middle pair to estimate the median in a cumulative frequency curve?

529 Views Asked by At

I just learned about cumulative frequency curve. The books says I could use this curve to estimate the median of the data. enter image description here

This is the picture that I cut from my book. As you can see, to estimate the median, the book find the x-coordinate of the point which has the y-coordinate (i.e. the cumulative frequency) of 150 (the total frequency is 300). My question is: why don't we use the mean of the 150th and the 151th data to estimate the median? Why do we use only the 150th data instead? This is really strange!

Also, what should we do to estimate the median when the total cumulative frequency is odd?

3

There are 3 best solutions below

2
On

The general definition of median is the following

$$me=F^{-1}(0.5)$$

That is the value of "Marks" corresponding at 50% of frequencies

EDIT: answer to your comment.

Assume now we have 10 equiprobable discrete values. The cumulative frequency function is the following (I did not draw the function for the latest 3 values because not important in this reasoning)

enter image description here as you can see... the X values corresponding to 50% of frequencies are the values between the 5th and the 6th values. That is the reason why the median is taken doing the "middle value" between 5th and 6th

0
On

When sample size $n$ is even, any value between the middle two of the (sorted) data qualifies as a median. Various textbooks and statistical software programs use different conventions, some use the lower of the two numbers, some use the upper, some use the average of the two.

Similarly, various books and programs have different conventions for other quantiles (including, lower and upper quartile). For large datasets there is usually no practical difference among these conventions.

Overall, there are about ten different methods of resolving these cases where there is no one right answer.

Consider the data (1,2,2,2,3,5,6,7,9,11): Here are examples of results from a few different 'types' of quantiles for these data from R (where type=7 is used unless you make another choice. For such small datasets it is easy to see the differences among types. (Proponents of each quantile type give reasons, sometimes elaborate technical ones, why their type is "best".)

x = c(1,2,2,2,3, 5,6,7,9,11)
quantile(x, type=1)
  0%  25%  50%  75% 100% 
   1    2    3    7   11 
quantile(x, type=2)
  0%  25%  50%  75% 100% 
   1    2    4    7   11     
quantile(x, type=4)
  0%  25%  50%  75% 100% 
 1.0  2.0  3.0  6.5 11.0 
quantile(x, type=6)
  0%  25%  50%  75% 100% 
 1.0  2.0  4.0  7.5 11.0 
quantile(x)  # default type 7
   0%   25%   50%   75%  100% 
 1.00  2.00  4.00  6.75 11.00 
quantile(x, type=4)
  0%  25%  50%  75% 100% 
 1.0  2.0  3.0  6.5 11.0 

But for a large sample of $n=1000$ observations from a normal distribution, differences among 'types' are usually relatively unimportant. Here are a few examples:

set.seed(2021)
y = rnorm(1000, 100, 15)
quantile(y, type=3)
       0%       25%       50%       75%      100% 
 51.97282  89.29286 100.07303 110.89819 152.34496 
quantile(y, type=6)
       0%       25%       50%       75%      100% 
 51.97282  89.31540 100.07334 110.90887 152.34496 
quantile(y, type=7)
       0%       25%       50%       75%      100% 
 51.97282  89.36047 100.07334 110.90175 152.34496 

If your textbook, instructor, boss, or project manager has a favorite quantile type, then it probably best to use that type so everyone in the group will get consistent answers. If you are working on your own project, then you might pick your own favorite (or use the default from your software).

0
On

To make it simple:

Basically it is that if you are estimating, like from a cumulative frequency table, then you're already losing some accuracy so adding the 1 doesn't really add much to your answer. It's usually easier to just half it, and since you are estimating then doing the easy thing is more important.