Finding out what data is useful to use...

41 Views Asked by At

I'm trying to determine the best method for narrowing down data to use in a program I'm writing. I could use some help.

Here is a use case... Sometimes we have bad data in our system and it can lead to making decisions off that data. For instance,

I might have 2-5 variables.

Example:

Data Set 1 2.50, 2.35, 0.25

Data Set 2 3.55, 2.25, 2.10

Option 1 I can use the high value, medium value, or the low value. In data set 1 it's obvious the low value is a mistake and data set 2 the high value should be ignored.

Option 2 If I use the average for data set 1 it comes out to 1.7 which will still be too low.

I know there has to be a good way to write a formula that can cancel out what data is an outlier.

Thoughts?

1

There are 1 best solutions below

0
On

This would be a better question for cross validated. It is a hard problem and the statistics people have worked on it a lot.

You need to have a model of what good data looks like. For example, if you measure the length of a rod with an English tape measure, the results should cluster within $\pm \frac 18$ inch, maybe better than that. You can take the median and reject any outside that window. Maybe you think the data is normally distributed, though real world data often has longer tails. You can use the data to evaluate the standard deviation, then reject points that are too many standard deviations away from the mean. I have an old text on Exploratory Data Analysis that discusses these questions in some detail.