Sorry, I don't know the math words.
I have a data set that looks like the following. I am counting things, and the graph shows how many of each thing there is. for example, I have a LOT of thing #0, and not many of thing #300.
In my application, things that are ubiquitous are assumed to exist; I don't need to include them in my reports - they are just noise. In a real world example, if you listed all the things in the room in which you sit, most would list tables, chairs, computers, etc. Only the most contrarian person would list bacteria and air.
My complete graph has a much longer tail than you see here - 10x more data. The average for my data set is very low, something like 30. The standard deviation is about 300.
Is it mathematically sensible for me to choose a number of standard deviations and remove the data that is "to the left" of the cut-off? In my case, say I choose 4 standard deviations. 30 + 4*300 = 1230. So I'd cut off everything with a count greater than 1230.
Am I using this right?
EDIT
I am continuously modifying this data set. I want to recalculate the average and std dev values with some frequency.
