Context: The figure below plots the record-high yearly precipitation in each state against the state's record-high 24-hour precipitation. Hawaii is a high outlier, with a record-high yearly record of 704.83 inches of rain recorded in Kukui in 1982. (Figure will not embed because of new member rep.)
Question: Why is it that while taking an outlier away from a collection of data in a scatter plot the correlation of the data decreases? Will this always be the case? Mathematically, I think this can be proven, but could anyone explain why this is the case from a more conceptual/intuitive point of view?
Thoughts: I think I kind of get it, but I'm having difficulty articulating why this is the case here--would it be true that if all of the data were shifted up and the outlier moved to the bottom right of the figure, then there would be a slightly positive correlation without the outlier, but including the outlier would actually decrease the correlation?
The reason is that there is low correlation. If you remove outliers in this plot, then you end up with almost all dots being around 100 on the y axis, no matter where they sit on the x axis. Hawaii makes it appear like the correlation is stronger than it actually is.
So no, this will not always be the case. If you have strongly correlated data and removed outliers, your correlation will increase.