Interpretation of the given box and whisker plot

658 Views Asked by At

I wish to understand whether i have interpreted below box & whisker plot correctly; this will also assert my understanding on the same. (I am learning basic statistics & measure of dispersion)

Box & Whisker Plot:

enter image description here

Lets say the number line represents age of students then following is my interpretation.

  • Students age group is 2-9
  • There are more students with age 6-7 & 7-8.5
  • The average student age is 7
  • Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

Is my above understanding correct ? Also what other interpretations can i make ?

2

There are 2 best solutions below

2
On BEST ANSWER

Students age group is 2-9

Yes. 2 is the minimum age observed in the sample and 9 is the maximum age.

There are more students with age 6-7 & 7-8.5

Not exactly. Half of the children in the sample have ages represented within the 'box'; that is between 6 and 8.5. Roughly speaking, a quarter of the students are under 6 yrs old, a quarter of them are from 6 to 7 yrs old, a quarter are between 7 and 8.5 years old, and a quarter are older than 8.5 years.

The average student age is 7

More precisely, the median age is 7. (Less than half are below 7 and less than half are above 7.)

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

I don't think it is useful to use a boxplot to talk about 'density' with any precision. Certainly, it is true that about 3/4 of the students are concentrated within years 6 and 9 yrs of age (a span of 3-4 years, depending how you view age), while only 1/4 are in the longer span of years from 2 to 6. But a histogram is a better graphical device for showing 'densities'.

Note: A boxplot gives no information about how many students are in the sample. It is best to use boxplots only for samples larger than a dozen or so. The mechanism of making a boxplot depends on finding three numbers which cut sorted observations into four approximately equal parts. [They are the lower quartile $(Q_1)$ left end of the box, Median, heavy line within the box, and $(Q_3)$ right end of box.] If you have a sample of only seven observations, it is difficult to know how to divide them into four approximately equal 'chunks'.

Here is a histogram of a (fake) dataset of 40 ages that might have made your boxplot. A histogram is based on area: notice that each student is represented by one 'brick' of area within his or her bar of the histogram.

The tick marks beneath the histogram show 'exact' ages of the students (e.g, to the nearest number of weeks). At the resolution of this graph, tick marks for 2 or more students of very nearly the same age may appear as one mark.

enter image description here

Addendum: A comment expressed interest in means, medians, and modes of skewed distributions. Here are samples from two distributions: The first is $\mathsf{Gamma}(shape=2, rate=1/20)$ It is a right-skewed distribution with mode 20, median 33.37, and mean 40. A sample of size $n = 100$ has the following summary statistics:

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.441  19.121  33.433  40.629  49.972 203.525 

The sample mean and median are similar to the population mean and median. There is no formal mode because no two observations are exactly the same, but one might say that the modal interval of the histogram (lower-left in the figure below) is $(20, 40].$

The second distribution is $\mathsf{Beta}(2, 1)$ It has a left-skewed distribution with mode 1, median 0.7071, and mean 2/3. A sample of size $n = 100$ has the following summary statistics:

summary(y)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.08611 0.49792 0.71515 0.67491 0.87883 0.99579 

Again here, the sample mean and median closely imitate the population mean and mean, respectively. The modal histogram interval is $(0.9, 1.0].$

The figure below shows the gamma distribution at left and the beta distribution at right. The tick marks below the histogram show the locations of individual points. The curves are the density functions of the respective distributions.

enter image description here

set.seed(1234)
par(mfcol=c(2,2))
 x = rgamma(100, 2, .05)
 boxplot(x, horizontal=T, col="skyblue2")
 hist(x, prob=T, col="skyblue2"); rug(x)
   curve(dgamma(x, 2, .05), add=T, lwd=2)

 y = rbeta(100, 2, 1)
 boxplot(y, horizontal=T, col="skyblue2")
 hist(y, prob=T, col="skyblue2"); rug(y)
   curve(dbeta(x, 2, 1), add=T, lwd=2)
par(mfrow=c(1,1))

Note to @linuxuser: If your textbook does not discuss gamma and beta distributions, you can read about them in Wikipedia. Both families of distributions are widely used in applied probability modeling. [Roughly speaking, the gamma function $\Gamma(\cdot),$ used to define the density functions, is a continuous version of the factorial function, filling in values for non-integers. For positive integer $k$, we have $\Gamma(k) = (k-1)!;$ for example $\Gamma(5) = 4! = 24.$]

5
On
  • Students age group is 2-9

  • There are more students with age 6-7 & 7-8.5

  • The average student age is 7

  • Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

The point $1$ is correct.

Note that the point 2 contradicts the point 4: each invertal is roughly $25\%$ of the data, so $Q_1$-$Q_3$ is roughtly $50\%$ of the data. Also, the statement is not complete: "more students with age 6-7 & 7-8.5" than which group? Do you mean more students compared with other specific interval or in general?

In the point $3$, the word "average" is ambiguous, as there are three types of averages: mean, median and mode. Here $Q_2$ is the median. Depending on the shape of distribution (there can be three types: positively-skewed, negatively-skewed, symmetric), you can have different relationships of the mean, median and mode (usually, mode$<$median$<$mean, mean$<$median$<$mode, mean$\approx$median$\approx$mode, respectively, however, for symmetric not always). The data looks negatively-skewed, because $75\%$ data are in the interval $6$-$9$ against $25\%$ in $2$-$6$, which implies the data (ages, basically the number of students) is more densely situated in the interval $6$-$9$. Consequently, you can say the data is less variable (closely situated) in the interval $6$-$9$ compared with the interval $2$-$6$.