I have some outliers in a data set that I would like to deal with.
After doing some reading into the topic I came to the conclusion that the trimmed mean function and the boxplot (in rstudio) function might come in handy when trying to deal with the outliers.
I like to work with concrete baby examples, so let me just give a concrete case and show you what I have done so far, perhaps you could guide me in the correct direction.
$a$ <- $c(1,2,3,100,5,3,200,5,2,3) \implies mean(a) = 32,4.$
Obviously the two outliers are 100 and 200. Thus, if we want to find a more "realistic" mean value we simply remove 100 and 200 from the data set.
Since n = 10 in the data set a, and since there are two outliers (100,200), the "correct" way to trim a from the outliers would be the following command:
$mean(a, trim = 0.2) = mean(3+5+3+5+2+3) = 3.5$
In this baby example it was easy to find the "correct" trim value, which in this case was $0.2 = 20%$.
What I am looking for is a more general method on, how I can (given a large data set) find the correct trim value such that I can remove the outliers that are artificially skyrocketing the mean value.
Boxplot seems to be a good way to visualize the outliers, but in a very large dataset it wouldn't give me any concrete numbers regarding how many outliers there might.
While writing this question I just came up with an idea. Perhaps I could:
- Loop through the columns in the row that I want to analyze for outliers.
- If the number in the column > the real mean then count.
- Now I have a variable which contain the number of columns in the row > the real mean.
- If say, (number of columns in the row > the real mean) = 10 and the total amount of columns in the row is 100. Then to remove the 10 outliers from the 100 columns, we say that the trim value should be $\frac{10}{100} = 0.1$
- So the command dealing with the 10 outliers would be $mean(vector, trim = 0.1)$
Any comments or help is appreciated.
If you have a samples of size $n = 100$ from a normal distribution, you will typically see outliers in half or more of the samples. Here are boxplots of 20 samples, each of size 100.
More than half have at least one outlier.
For exponential data, almost all sample of size 100 have outliers; outliers are a typical characteristic of this distribution.
You really want to delete all of those outliers?