I'm studying Boxplots for statistics.
What i've not understood is : since a boxplot represents 4 quartiles and each of them collects the 25% of the dataset, why is it possible to have data samples that are outside the boxplot ?
I'm studying Boxplots for statistics.
What i've not understood is : since a boxplot represents 4 quartiles and each of them collects the 25% of the dataset, why is it possible to have data samples that are outside the boxplot ?
On
Data points may well be beyond the whiskers of the plot.
The box itself is drawn from the value of the lower quartile to the upper quartile.
The median is marked within the box.
The extent of the whiskers however is normally given by the least and greatest values of the data within 1.5 times the inter-quartile range from the lower and upper quartile respectively. The whiskers do not normally extend to the least and greatest values of the data, which is probably what you are expecting.
Here is a sample of 100 observations from the distribution $\mathsf{Norm}(\mu=300,\,\sigma=8),$ rounded to integers. (Numbers in brackets show the index of of the first observation in each row.)
I think you may be confusing the box of the boxplot with the entire boxplot (box, whiskers, outliers). Roughly speaking the box contains only half of the data, while all of the data are represented by the entire boxplot.
Below is a boxplot of these observations. Observations in the first row (1/4 of the observations) lie below the left end of the box. Those in the second row lie within the box below the median (solid bar); those in the third row within the box and above the median. Those in the last row lie to the right of the box. (Small tick marks show locations of individual observations. There are only 56 of them because ties are double-plotted.)
In particular, notice that only half of the observations lie within the box. The box contains observations from 291 through 312.
Of the quarter of the observations to the left of the box, 255 is shown as an outlier. The interquartile range $IQR = 22$ and the lower quartile is $Q_1 = 290.5.$ The observation at 255 is noted as an outlier because it is smaller than $Q_1 - 1.5(IQR) = 257.5.$ Similarly, the two largest observations $353$ and $354$ are outliers. They lie above $Q_3 + 1.5(IQR).$ (It is not unusual for a normal sample of size $n=100$ to have several outliers according to this boxplot "1.5 IQR" rule.)
Below is a summary of the data from R statistical software, along with some other relevant computations. (Different textbooks and software have slightly different definitions of the quartiles, but those differences are not important for our purposes.)