I performed ANOVA on a set of data which includes 6 groups (called 101-106), each group has between 6 and 8 observations, and all values are negative. I used python for that task and got p value < 0.05 which tells me that the group's means are not equal. Now I would like to know which group is different from which. Therefore I used Tukey's test (with python) which resulted in the following summary table:
group1 group2 meandiff lower upper reject
0 101 102 0.2917 -0.0425 0.6259 False
1 101 103 0.1571 -0.1649 0.4792 False
2 101 104 -0.1333 -0.4675 0.2009 False
3 101 105 0.0833 -0.2509 0.4175 False
4 101 106 -0.0500 -0.3626 0.2626 False
5 102 103 -0.1345 -0.4566 0.1875 False
6 102 104 -0.4250 -0.7592 -0.0908 True
7 102 105 -0.2083 -0.5425 0.1259 False
8 102 106 -0.3417 -0.6543 -0.0290 True
9 103 104 -0.2905 -0.6125 0.0316 False
10 103 105 -0.0738 -0.3959 0.2482 False
11 103 106 -0.2071 -0.5067 0.0924 False
12 104 105 0.2167 -0.1175 0.5509 False
13 104 106 0.0833 -0.2293 0.3960 False
14 105 106 -0.1333 -0.4460 0.1793 False
If the reject column says True we reject the null hypothesis and the means are NOT equal, if the reject column says False we accept the null hypothesis and the means are equal. As you can see, the result is a bit weird, for example group 101 is not different from the other groups, which cannot be true since it most be different from at least 1 group according to the ANOVA result. Also, group 102 and 104 are different, but they are both similar to group 103 which does not make any sense. Am I missing something?
I used this method (and syntax) in the past and it worked fine.
Groups:
101: -1.45, -1.35, -1.6, -1.6, -1.65, -1.65
102: -1.5, -1.4, -1.15, -1.1, -1.25, -1.15
103: -1.5, -1.6, -1.525, -1.125, -1.2, -1.5, -1.3
104: -1.9, -1.55, -1.55, -1.7, -1.95, -1.45
105: -1.55, -1.65, -1.5, -1.3, -1.3, -1.5
106: -2 -1.4 -1.8 -1.75 -1.15 -1.7 -1.45 -1.55
Thanks for the data which I entered into Minitab, thinking it is a good idea to compare output of what ought to be standard procedures between the two statistical packages.
.
The largest differences in means are between G2 and G4 and between G2 and G6. Because the F-test is significant at the 1% level (P-value 0.006), it reasonable to say that at least the largest difference (btw G2 and G4) must be statistically significant. It is not fair to make many direct comparisons among the 95% CIs in the output and figure above, because error probabilities may proliferate in making many comparisons. However, in view of the information above it should not be surprising the Python's Tukey procedure chooses the two largest differences as significant.
Below is Minitab's version of the Tukey procedure.
Thus we see that the Tukey procedure chooses exactly the two largest differences among pairs of group sample means as significant with a 'family' significance level of 5%. The Python and Minitab outputs are not in conflict.
I think it is possible that your confusion may have arisen from looking at differences among means instead of the group means themselves. Furthermore, if you are going to look at differences among means, you have to consider absolute rather than signed differences. The biggest absolute differences are declared significant by both Python and Minitab.