I have a large dataset where many values are in decimal (results of Euclidean distance calculation).
I was performing mean calculations on this dataset (which is on a spreadsheet) when I noticed this: If I separate the dataset to dataset1 and dataset2, I found that:
mean(dataset) != mean(mean(dataset1),mean(dataset2))
I was using the average function in Google spreadsheets. But the same happened when I transferred the data to Excel. My bet was that there are too many decimal places that Excel and Google Spreadsheets cannot handle. But the problem persisted even after using Python on the dataset and explicitly specifying that they are floats.
Even when reducing them to two decimal places: mean(dataset) = 23.76 != mean(mean(dataset1),mean(dataset2)) = 23.75
So my question specifically, is there a common practice to overcome this problem when calculating averages of numbers with many decimal places? Here are links to the datasets:
If the two datasets have different sizes, then of course you expect
mean(dataset) != mean(mean(dataset1),mean(dataset2))
For example:
mean(5,5,5,5,5) = 25/5=5
mean(0,0) = 0/2=0
mean(5,5,5,5,5,0,0) = 25/7=3.57
not equal to
mean(5,0)=2.5
See the interesting Simpson's Paradox