I have two algorithms A and B which have to compute a solution to some problem. Each solution is given some objective value which indicates the quality of the solution. I need to perform a Wilcoxon Signed-Rank Test to test whether there is any evidence that these two algorithms perform statistically significantly different from one another.
I have performed 12 trials of each algorithm and tabulated the objective values from solutions found during each trial. A smaller objective value is better.
A B
878 890
872 888
865 879
877 874
872 870
890 886
873 871
887 879
868 873
888 882
878 881
I am confused about a few details of performing this test.
Should I do a one-tailed or two-tailed test?
I'm not sure what my null hypothesis is. What should it be, given I want to find out whether algorithm
AandBperform significantly different from one another?If $\ p$-value $> 0.05$, what does this mean?
If $\ p$-value $< 0.05$, what does this mean?
If your null hypothesis is that they have the same means and your alternative hypothesis is that the means differ, then you will regard $B$ being higher than $A$ as being as extreme a result as $A$ being higher than $B$. So you want a two-tailed test
The $p$ values is the probability that, if the null hypothesis is true, you see differences as extreme as or more extreme than the ones you actually did see. $p \lt 0.05$ has that probability less than $0.05$ so if the null hypothesis is true you would expect on average to see such results fewer than one in twenty times, while $p \gt 0.05$ is the opposite so if the null hypothesis is true you would expect on average to see such results more than one in twenty times, so the former may be a stronger indication from your observations that the null hypothesis may not be correct