Finding a faulty value in a dataset

30 Views Asked by At

I have prices from wholesalers for a product. But sometimes the wholesalers doing mistakes and list the product to a unrealistic price. I want to filter the unrealistic prices out. This is an example:

[0] => Array
    (
        [articlenumber] => 32158021n
        [wholesaler] => wholesaler_0
        [price] => 37.00
    )

[1] => Array
    (
        [articlenumber] => 32158021n
        [wholesaler] => wholesaler_1
        [price] => 204.33
    )

[2] => Array
    (
        [articlenumber] => 32158021n
        [wholesaler] => wholesaler_2
        [price] => 209.32
    )

[3] => Array
    (
        [articlenumber] => 32158021n
        [wholesaler] => wholesaler_3
        [price] => 211.12
    )

[4] => Array
    (
        [articlenumber] => 32158021n
        [wholesaler] => wholesaler_4
        [price] => 243.89
    )

As you can see wholesaler_0 is offering the product to an unrealistic price (37 compared to >200). Which is the best way to identify the faulty price?

What I have thought:
To check a price I would calculate the average of all other prices. Which would be for the first price:
37 compared to 217.17, but what now?
I could say everything that is 50% of the average price is invalid, so (217.17/2) = 108.59 > 37 and that would make the price invalid, as every price below 108.59$ would be invalid.
BUT
Lets say we have a 1 dollar price compared to 3 dollar average price, which would mean that the price is one third of the average price. But this is a realistic price, because there is only a difference of 2 dollar. It would mean that the percentage range of accepted prices should increase if the price gets smaller.

What is the correct way to calculate this? Could someone give me an advice?

Thanks in advance!

EDIT: Here is my solution: I have created a exponential formula which gives me the right percentage which a value can differ to be a faulty price. To build this I have used the help of a coordinate system:

80 * 0.96x +2.0

2

There are 2 best solutions below

1
On BEST ANSWER

There's no one "correct" way to detect outliers, and heuristics that incorporate your domain-specific knowledge could work better than a general-purpose approach. But I'd consider calculating a z-score for each entry based on the mean and standard deviation of the item prices.

0
On

(Caveat: I am not a professional statistician, I hope more people will chip in to answer this.)

There isn't one single correct way to calculate this. One common way is to try to:

  • Figure out what sort of distribution those prices have. Draw the bar chart and see if the prices resemble some known distribution. For example, the "bell" curve (which would imply Gaussian distribution). (NB there are methods to check if the resemblance is just superficial, e.g. Pearson's $\chi^2$ test - but let's not get distracted.) For prices, in particular, it is worth checking if the logarithm of the price fits Gaussian distribution better than the price itself.
  • Try to estimate the parameters of this distribution. For example, for Gaussian distribution you could try to calculate the mean and the standard deviation, by supplying the full sample or some subset of the sample which you trust. (If the sample is big enough and the mistakes are rare, there won't be too much difference between those two methods.) Excel, for example, has built-in functions to calculate means and standard deviations of samples.
  • Now for each wholesaler try to estimate the probability that their price would be that far from the mean as it happens to be.

For example, if the distribution is Gaussian, the mean is $m$ and the standard deviation is $\sigma$, as per the Wikipedia table, about $68\%$ of all data will fall between $m-\sigma$ and $m+\sigma$ and around $95\%$ of data will fall between $m-2\sigma$ and $m+2\sigma$. In other words, it is only $5\%$ probable that a price will fall outside of this interval by chance. With each additional $\sigma$, the probability falls rapidly.

Still, it is your decision what percentage to count as a "cut-off": if, say, you are happy to say you want to declare a mistake if a wholesaler is providing a price that seems so far from $m$ that the chance of this happening is only $1\%$, as per same table you would use the interval $(m-2.576\sigma, m+2.576\sigma)$. So ultimately the maths can only go that far, but it is still your decision what you will count as a probable mistake. Adding to that:

  • We only model the actual distribution of prices as one of known distributions (e.g. Gaussian). The distribution may be nothing like Gaussian. If it is common to have, say, $50\%$ discounts, then the prices will probably be clustered around two different prices - one half of the other.
  • The sample may be too small. One common way to have a better estimate of the mean and standard deviation is to up-front "remove the outliers". One common way to do that is to fix a percentage (say $5\%$), and then only take the $5$-percentile to $95$-percentile of the samples (i.e. order/sort the prices and then dismiss the lowest and highest $5\%$). This may be impossible with less than, say, $20$ wholesalers. Going further down, and the very estimates of the mean and standard deviation will also become inaccurate.
  • If a price is outside the range you have calculated, it doesn't mean it is necessarily a mistake. After all, there is that small probability (which you previously chose) that it did happen by chance. On the other hand, if a price is within the range, it does not mean it is not a mistake.

Altogether, this is not really exact science. Still it is certainly way better than poring over figures and saying "this one looks odd" - at least it lets you quantify how much odd it is.

The whole area where this fits in maths is "statistical hypothesis testing" - try to find some references and see if you can make sense of the bigger theory behind it.