Intuition behind unequal class intervals histogram

803 Views Asked by At

In a histogram with unequal class interval, say for example, the data:

Number of Fruits | Frequency
----------------------------
    1-2          |     5
    2-3          |     6
    3-5          |    10
    5-10         |     9
    10-24        |    12

Why does it make sense to find the frequency density to correct for the unequal class intervals? What is the intuition behind the logic?

I understand the way to find the density is:

$$\frac{\text{Frequency}}{\text{Size of Class Interval}}\times\text{Lowest Class Size}$$ How would someone go about reading such a histogram?

1

There are 1 best solutions below

0
On

Here is data simulated in R that matches your frequency table. A fundamental principle of a histogram is that each observation be represented by the same basic unit of area.

The histogram shown below, is a frequency histogram, in which bar heights are chosen so that the total area in the histogram will be $1.$ (The tick marks on the horizontal axis, show exact locations of my simulated points.)

set.seed(813)
x = c(runif(5, 1,2), runif(6, 2,3), runif(10, 3,5),        
      runif(9, 5,10), runif(12, 10,24))
ends = c(1, 2, 3, 5, 10, 24)
hist(x, br=ends, col="skyblue2"); rug(x)

enter image description here

In R a 'nonprinted' histogram provides information about how the bars are drawn, including bar heights.

hist(x, br=ends, plot=F)
$breaks
[1]  1  2  3  5 10 24

$counts
[1]  5  6 10  9 12

$density
[1] 0.11904762 0.14285714 0.11904762 0.04285714 0.02040816

$mids
[1]  1.5  2.5  4.0  7.5 17.0

w = diff(hist(x, br=ends, plot=F)$breaks)  # widths
h = hist(x, br=ends, plot=F)$density       # heights
a = w*h;  a                                # areas
[1] 0.1190476 0.1428571 0.2380952 0.2142857 0.2857143
sum(a)
[1] 1                                # total area = 1

Depending on your interests, you might want to see if you can figure out how 'densities' are derived from 'frequencies'. I am not familiar with the term 'frequency density'.

Ordinarily, histograms based on varying interval widths are deprecated in statistical practice because they are difficult for many people to interpret.

By contrast, here is a 'frequency' histogram of my version of your data. It uses intervals of equal widths. That makes it possible to show a vertical Frequency scale--even to label each bar with the number of observations it represents.) I have added dotted horizontal lines to show 42 equal 'blocks of area', one for each observation. (One would not necessarily show labels atop bars or horizontal lines in a histogram for publication.)

ends.2 = seq(0, 24, by=2)   # equally spaced
hist(x, br=ends.2, col="skyblue4", ylim=c(0,13), label=T, 
      main="Frequency Histogram")
abline(h = 0:11, col="green", lty="dotted")

enter image description here

Here is a histogram of my version of the data that I believe to be more easily readable than the one suggested in your exercise. [Using rug to make tick marks for individual observations works best if there are fewer than about 100 observations.]

ends.3 = seq(0,25, by=5)
hist(x, br=ends.3, col="skyblue2", ylim=c(0,17),
     main="Frequency Histogram")
rug(x)

enter image description here

Note: The author of the problem may have had other learning objectives in mind, but I hope you will remember: (1) A fundamental principle of a histogram is that each observation is represented by the same amount of area. (2) For presentation to non-statisticians, it is seldom necessary or desirable to make a histogram with unequal bin widths.