How to deal with Statistical outliers in Rstudio

93 Views Asked by At

I have some outliers in a data set that I would like to deal with.

After doing some reading into the topic I came to the conclusion that the trimmed mean function and the boxplot (in rstudio) function might come in handy when trying to deal with the outliers.

I like to work with concrete baby examples, so let me just give a concrete case and show you what I have done so far, perhaps you could guide me in the correct direction.

$a$ <- $c(1,2,3,100,5,3,200,5,2,3) \implies mean(a) = 32,4.$

Obviously the two outliers are 100 and 200. Thus, if we want to find a more "realistic" mean value we simply remove 100 and 200 from the data set.

Since n = 10 in the data set a, and since there are two outliers (100,200), the "correct" way to trim a from the outliers would be the following command:

$mean(a, trim = 0.2) = mean(3+5+3+5+2+3) = 3.5$

In this baby example it was easy to find the "correct" trim value, which in this case was $0.2 = 20%$.

What I am looking for is a more general method on, how I can (given a large data set) find the correct trim value such that I can remove the outliers that are artificially skyrocketing the mean value.

Boxplot seems to be a good way to visualize the outliers, but in a very large dataset it wouldn't give me any concrete numbers regarding how many outliers there might.

While writing this question I just came up with an idea. Perhaps I could:

  1. Loop through the columns in the row that I want to analyze for outliers.
  2. If the number in the column > the real mean then count.
  3. Now I have a variable which contain the number of columns in the row > the real mean.
  4. If say, (number of columns in the row > the real mean) = 10 and the total amount of columns in the row is 100. Then to remove the 10 outliers from the 100 columns, we say that the trim value should be $\frac{10}{100} = 0.1$
  5. So the command dealing with the 10 outliers would be $mean(vector, trim = 0.1)$

Any comments or help is appreciated.

1

There are 1 best solutions below

1
On

If you have a samples of size $n = 100$ from a normal distribution, you will typically see outliers in half or more of the samples. Here are boxplots of 20 samples, each of size 100.

x = rnorm(2000, 100, 15)
g = rep(1:20, each=100)
boxplot(x ~ g, col="skyblue2", pch=20)

More than half have at least one outlier.

enter image description here

For exponential data, almost all sample of size 100 have outliers; outliers are a typical characteristic of this distribution.

x = rexp(2000, .01)
g = rep(1:20, each=100)
boxplot(x ~ g, col="skyblue2", pch=20)

enter image description here

You really want to delete all of those outliers?