Boxplots and Quartiles

88 Views Asked by At

I'm studying Boxplots for statistics.

What i've not understood is : since a boxplot represents 4 quartiles and each of them collects the 25% of the dataset, why is it possible to have data samples that are outside the boxplot ?

2

There are 2 best solutions below

0
On BEST ANSWER

Here is a sample of 100 observations from the distribution $\mathsf{Norm}(\mu=300,\,\sigma=8),$ rounded to integers. (Numbers in brackets show the index of of the first observation in each row.)

x

 [1] 255 259 261 263 266 271 275 275 276 276 277 278 279 279 280 282 283 284 285 286 287 287 288 288 289
[26] 291 292 292 293 293 294 294 295 295 295 295 295 295 297 297 298 298 299 299 299 300 300 302 302 303
[51] 303 303 304 304 304 304 305 305 307 307 307 307 308 308 308 309 309 309 310 311 311 311 312 312 312
[76] 314 314 314 315 317 318 318 320 320 321 322 324 324 324 325 327 327 328 330 330 331 336 339 353 354

I think you may be confusing the box of the boxplot with the entire boxplot (box, whiskers, outliers). Roughly speaking the box contains only half of the data, while all of the data are represented by the entire boxplot.

Below is a boxplot of these observations. Observations in the first row (1/4 of the observations) lie below the left end of the box. Those in the second row lie within the box below the median (solid bar); those in the third row within the box and above the median. Those in the last row lie to the right of the box. (Small tick marks show locations of individual observations. There are only 56 of them because ties are double-plotted.)

In particular, notice that only half of the observations lie within the box. The box contains observations from 291 through 312.

Of the quarter of the observations to the left of the box, 255 is shown as an outlier. The interquartile range $IQR = 22$ and the lower quartile is $Q_1 = 290.5.$ The observation at 255 is noted as an outlier because it is smaller than $Q_1 - 1.5(IQR) = 257.5.$ Similarly, the two largest observations $353$ and $354$ are outliers. They lie above $Q_3 + 1.5(IQR).$ (It is not unusual for a normal sample of size $n=100$ to have several outliers according to this boxplot "1.5 IQR" rule.)

enter image description here

Below is a summary of the data from R statistical software, along with some other relevant computations. (Different textbooks and software have slightly different definitions of the quartiles, but those differences are not important for our purposes.)

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  255.0   290.5   303.0   301.8   312.5   354.0 
boxplot.stats(x)$out
## 255 353 354
IQR(x)
## 22
length(unique(x))
## 56
outliers = replicate(10^6,  length(boxplot.stats(rnorm(100, 300, 8))$out) )
mean(outliers > 0)
## 0.521605   # slightly over half of normal samples of size 100 have at least one outlier
mean(outliers)
## 0.923327   # a million samples average almost 1 outlier per sample
mean(outliers >=2)
## 0.23338    # over 20% of the samples have at least 2 outliers 
0
On

Data points may well be beyond the whiskers of the plot.

The box itself is drawn from the value of the lower quartile to the upper quartile.

The median is marked within the box.

The extent of the whiskers however is normally given by the least and greatest values of the data within 1.5 times the inter-quartile range from the lower and upper quartile respectively. The whiskers do not normally extend to the least and greatest values of the data, which is probably what you are expecting.