Boxplot: whiskers and outliers doubt

742 Views Asked by At

I have a doubt on boxplot.
I'll expose my knowledge and then my doubt.

  • $x=\{x_1,x_2...x_n\}$: the set of samples
  • $q_1$,$q_3$: the first and third quartiles
  • $w_l$,$w_u$: the lower and upper whiskers
  • $IQR = q_3 - q_1$
  • box extends from $q_1$ to $q_3$
  • $w_l = max(min(x),q_1 - 1.5\cdot IQR)$
  • $w_u = min(max(x),q_3 + 1.5\cdot IQR)$
  • $outliers = \{ x_i \in x \; | \;\; x_i < w_l \vee x_i > w_u\}$

Observations:

  • $\text{whiskers' distance from box are not symmetric} \\ \iff (w_l = min(x) \vee w_u = max(x)) $
  • $w_u - q_3 < q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i > w_u$
  • $w_u - q_3 > q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i < w_l$

My doubt: if all what I exposed is correct, how do you explain the presence of outliers in this speed of light boxplot (third experiment, lower outliers) and in this plot (see wednesday, lower outliers)?
In the case my reasoning is wrong, please provide a simple numeric counterexample.

2

There are 2 best solutions below

5
On BEST ANSWER

Consider the data $$\{0,4,5,5,5,6,6,6,6,7,20\}.$$ The median is $6$, the first quartile is $5$, and the third quartile is $6$. So the IQR is $1$ and it easily follows that $\{0\}$ is a lower outlier and $\{20\}$ is an upper outlier. What you need to take into account is that the box shows you where 50% of the data lies, so if this is particularly narrow, then the IQR is small, and any values outside the range determined by the 1.5IQR rule are outliers. There can be many outliers, or none at all.

0
On

Ok I got the answer:

The definitions of $w_l$ and $w_u$ in my question were wrong. Referring to Wikipedia:

"whiskers can represent several possible alternative values" such as "the minimum and maximum of all of the data" or "the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile", or even "one standard deviation above and below the mean of the data" and finally "the 9th percentile and the 91st percentile" or "the 2nd percentile and the 98th percentile".