Find errored member in a dataset

42 Views Asked by At

Assuming a simple data-set looking like this

dataset

As you can see there is 10 lines, each one representing an entity. For each entity, there is an amount of downloaded and uploaded data which are for the example rounded values, but it could be floated values as well.

The summ of all values is done for each column. The rule is that when we substract (downloaded total) - (uploaded total), the result should always be zero. As you can notice it, the result of this substraction for this dataset is +25.

From there, and assuming any other dataset following this pattern, how could i identify the errored member, if one exists (meaning the rule is not fullfilled). If it isn't possible to be 100% sure of the errored member in a dataset, what is the best way to calculate probabilities for each member being the errored one?

Hope i was clear with my request

Best regards

William

NOTE: I am from stackoverflow, but since it is a pure math question i came here. I am not asking for an implementation, but for advices and approachs about resolving a problem like this.

EDIT: Here are some informations i have forget to provide about the dataset.

  1. The number of entities is not limited and the size is not predictible, it could be small or large (1 to 10,000 approx)
  2. In reality, teh repeated 'ties' in the dataset shouldn't appear so often. It is just to have a readable and simple example.
  3. Here i use rounded values in the dataset, but we should expect floated values most of the time (99,99%).
  4. What i mean by an 'errored member' is a value that doesn't help to fullfill the rule. Also, there could be 0 or more errored members.
1

There are 1 best solutions below

1
On

You have not given anywhere near enough information for me to give you a responsible answer. So I will make some very speculative assumptions, and give you an answer based on those assumptions. If you are not satisfied with the answer, maybe that will prompt you to provide more insight into what your data are like and what you really want to do.

I assume that you always have a table of 10 or more rows ('entities') and that both upload and download numbers are from processes that are at least approximately normal. Then you have $n = 10$ or more row (upload minus download) differences $D_i$ which might be approximately distributed. You are asking if their average $\bar D$ is consistent with a true (population) average difference $\Delta = 0.$

In that case you have a paired t-test. The $T$ statistic is $$T = \frac{\bar D - 0}{S_D/\sqrt{n}},$$ where $T$ has Student's t distribution with degrees of freedom $df = n - 1.$

In your example, the $n = 10$ differences are 0, 20, -50, 0, 5, 50, 30, 25, -55, and 0. Thus, their mean is $\bar D = 2.5,$ their standard deviation is $S_D = 33.19,$ and $T = 0.2382.$ In this case you would have found a statistically significant difference if $|T| > 2.262.$ So your result is nowhere near being significant. If you know about P-values, the P-value for your data is 0.817; it would have to be $smaller$ that 0.05 in order to declare a significant difference at the 5% level of significance.

Even with only ten pairs, the t test is reasonable tolerant of non-normal data as long as there is not marked skewness or far outliers. For your data a Shapiro-Wilk test of normality does not come anywhere near declaring your 10 differences to be other than normal. (But to be fair, it is very difficult to detect nonnormality for such small samples.)

If you are suspicious that, generally speaking (not just in this case), your data might be far from normal, then you could use a nonparametric Wilcoxon signed-rank test (see Wikipedia or a basic statistics text) to judge whether the $median$ of your $n$ differences is significantly different from 0. This test does not assume normal data.

For your particular data, this test does not work well because you have so many 'ties' repeated values among the $D_i.$ I suspect this may be a peculiarity of your particular data sample.

In case ties are an endemic difficulty, there are other possible paired tests available.

Here are some of the many possible difficulties trying to make sense of your situation with so little information:

(1) When one fails to find significance, the next issue is to try to assess the power of the test. That is the probability that one could have rejected if a difference of 'important' size were actually present. For example, with only 10 differences, it might be that you would almost never discover errors of a size that really would be of concern to you. From what you have said, I have no basis at all to comment on this possible difficulty.

(2) The numbers in your example are 'outrageously' round--not just integers, but all ending in 0 or 5. You said real data might not even be integers, but have several decimal places. I would feel better about using the t test if that were true.

(3) It would also be better if you typically had more than 10 differences $D_i$ to use in deciding if your overall table reveals an error. And it would be helpful if I didn't have to guess what you mean mean by an 'error'.