Which of the two methods I am mentioning here is correct for calculating the Mode of a grouped data?

47 Views Asked by At

The question is as follows: -

Question for calculation of Mean, Median and Mode

Direct calculation of mode:-

Direct calculation of mode

Calculation of mode using mean and median and the relation between mean, median and mode:-

Calculation of mode through other method

I am getting two different answers through two methods as shown in the pictures attached above. Which one is correct?

1

There are 1 best solutions below

0
On

Finding a mode of data can be difficult, especially if it's grouped data. The choice of group boundaries can make a big difference in the answer you get. As a practicing statistician, I have to say the problems about finding modes of small samples from grouped data are far more prevalent in elementary statistics classes taught by mathematicians than in real-life applications.

The first thing you need to look for in order to make sense of finding the mode for data is the definition. Can there be only one mode according to the definition or can there be several modes? (Sometimes one speaks of 'multi-modal' samples.)

Especially when sample sizes are small, the mode of grouped data, or data in a histogram, can depend greatly on what group boundaries are chosen.

While you are a student in a beginning statistics class, you need focus on the textbook definition of sample mode, and figure out how to use the formulas in your text or class notes. Getting alternative definitions and formulas from various sources is not the most direct route to success in the class or to understanding statistical principles. If you do serious applied statistics later on, then there may be better methods for you to use.

Small samples. Consider the following sample of size $n = 30$ from a normal distribution with mean $\mu=100$ and standard deviation $\sigma=15.$ Some numerical descriptive statistics made directly from the data are shown below.

set.seed(430)
x = rnorm(30, 100, 15)
summary(x); length(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  81.28   90.82   98.05  101.37  113.11  135.43 
[1] 30          # sample size
[1] 13.88452    # sample standard deviation

Below we have two histograms. These are density histograms because the widths and heights are scaled so that the total area of all the bars in a histogram sums to $1.$ Tick marks on the horizontal axis show the exact values of the thirty observations. The orange curves show the normal population density curve.

At left the modal interval is $(90,100].$ The usual formulas for getting the mode from grouped data would put "the mode' somewhere inside that interval. Some texts might say a second (or minor) modal interval is $(110,120).$ If the purpose of the sample mode is to estimate the mode of the distribution, then neither one will give a very good estimate; the mode of a of the normal distribution is at $100.$

At left is a histogram with one, two, or three modal intervals--depending on which book you read. Formulas might put 'the mode' at about 95.

enter image description here

R code for figure:

par(mfrow=c(1,2))
hist(x, prob=T, col="skyblue2");  rug(x)
 curve(dnorm(x, 100, 15), add=T, col="orange", lwd=2)
hist(x, prob=T, br=15, col="skyblue2"); rug(x)
 curve(dnorm(x, 100, 15), add=T, col="orange", lwd=2)
par(mfrow=c(1,1))

The histograms is the figures below are the same as above. Here the curves are 'kernel density estimators' (CDEs) based on the data. CDEs attempt to guess what the population density curve might look like. With only thirty observations KDEs typically aren't very good. Here they would estimate the population mode at about 92.

enter image description here

par(mfrow=c(1,2))
hist(x, prob=T, col="skyblue2");  rug(x)
 lines(density(x), col="maroon", lwd=2, lty="dotted")
hist(x, prob=T, br=15, col="skyblue2"); rug(x)
 lines(density(x), col="maroon", lwd=2, lty="dotted")
par(mfrow=c(1,1))

Larger samples. By contrast, with $n = 300$ observations instead of $30$ we have enough data to make sense of the concept of the mode of a sample. There is one modal interval, and formulas based on a histogram will estimate the mode at a little above $100.$ Also, the KDE estimates the population mode almost exactly at $100$ (even if $n=300$ isn't a large enough sample size for KDEs to have a nice normal shape).

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  59.49   88.55   99.51   99.14  108.33  140.00 
[1] 300
[1] 15.56456

enter image description here

par(mfrow=c(1,2))
hist(x, prob=T, col="skyblue2")
 curve(dnorm(x, 100, 15), add=T, col="orange", lwd=2)
  lines(density(x), col="maroon", lwd=2, lty="dotted")
hist(x, prob=T, br=15, col="skyblue2")
 curve(dnorm(x, 100, 15), add=T, col="orange", lwd=2)
  lines(density(x), col="maroon", lwd=2, lty="dotted")
par(mfrow=c(1,1))

Large sample. With $n = 3000$ there is little doubt about the sample mode. It is a good estimate of the population mode (the point where the density function achieves its maximum).

set.seed(1234)
x = rnorm(3000, 100, 15)
summary(x); length(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  49.06   90.02  100.18  100.08  110.05  147.94 
[1] 3000
[1] 14.95474

enter image description here

Note: In R, a KDE has 512 $(x,y)$ components. One can find the mode of the KDE as follows:

X = density(x)$x
Y = density(x)$y
mean(X[Y==max(Y)])
[1] 100.4143