I have a doubt on boxplot.
I'll expose my knowledge and then my doubt.
- $x=\{x_1,x_2...x_n\}$: the set of samples
- $q_1$,$q_3$: the first and third quartiles
- $w_l$,$w_u$: the lower and upper whiskers
- $IQR = q_3 - q_1$
- box extends from $q_1$ to $q_3$
- $w_l = max(min(x),q_1 - 1.5\cdot IQR)$
- $w_u = min(max(x),q_3 + 1.5\cdot IQR)$
- $outliers = \{ x_i \in x \; | \;\; x_i < w_l \vee x_i > w_u\}$
Observations:
- $\text{whiskers' distance from box are not symmetric} \\ \iff (w_l = min(x) \vee w_u = max(x)) $
- $w_u - q_3 < q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i > w_u$
- $w_u - q_3 > q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i < w_l$
My doubt: if all what I exposed is correct, how do you explain the presence of outliers in this speed of light boxplot (third experiment, lower outliers) and in this plot (see wednesday, lower outliers)?
In the case my reasoning is wrong, please provide a simple numeric counterexample.
Consider the data $$\{0,4,5,5,5,6,6,6,6,7,20\}.$$ The median is $6$, the first quartile is $5$, and the third quartile is $6$. So the IQR is $1$ and it easily follows that $\{0\}$ is a lower outlier and $\{20\}$ is an upper outlier. What you need to take into account is that the box shows you where 50% of the data lies, so if this is particularly narrow, then the IQR is small, and any values outside the range determined by the 1.5IQR rule are outliers. There can be many outliers, or none at all.