How do you call it when you remove the top n% and the bottom n% of a dataset?

1.3k Views Asked by At

I am currently writing about a dataset of collected handwritings. I want to show some characteristics of the dataset. For example I think it is interesting to show how long it took users to create the dataset.

So I extract for each recording the time and thus get a list of non-negative real numbers. As a few instances have values > 30,000 and some < 5, but most instances are in [30, 60], I want to cut off those outliers and visualize only the rest in a plot.

So I remove the top 0.5% and the bottom 0.5% before visualizing it (where the x-axis is the time $t$ and the y-axis is the number of recordings with recording time lower than $t$).

Is there a name for removing the lower 0.5% and the top 0.5% of all datasets? (where 0.5% refers to the total number of datasets, not to the values)

1

There are 1 best solutions below

5
On BEST ANSWER

When I have done this in the past, this was called "trimming" (not my term).

I used this to make graphics more visible, and I typically trimmed the top and bottom 5% of values, not 5% of number of points.

More specifically, if the values ranged from 0 to 100, I removed all the points with values that were > 95 or < 5, and then rescaled the display so the remaining points were displayed from the min to the max (usually 0 to 255).

I found that this made details in the data much easier to see.

Another method is to generate a histogram of the values and adjust the displayed values such that the modified display has a uniform histogram. This essentially uses the inverse distribution.