Suppose I have two methods, method A and method B. We evaluate the performance of these algorithms in 100 consecutive experiments with either positive (1) or negative outcome (0).
Method A achieves 84 positive outcomes and method B achieves 56 positive outcomes.
Is there measure that sort of tells us how sure we can be that A didn't perform better by chance?
My idea would be the following:
Fit a binomial distribution to the results produced in A (ML estimate). Than calculate the likelihood of the results of B being produced by that distribution ... although I don't really know how that is done.
Are there any suggestions to how to retrieve a really meaningful statement concerning whether A is not just by chance better than B?
Am I on th right track?
Yes, right general track. But I don't think you need to derive the test using ML; most texts show a test with a normal and/or chi-squared test statistic. Can't be sure about those details without knowing more about what you have been studying lately. The Comment by @lulu suggests the two methods are not the same. I agree, here are two (similar) formal tests.
First, see test comparing 2 binomial proportions for the equations and theory. (Or look in your textbook.)
Second, here is Minitab output for your data, which indicates you should reject $H_0: p_A = p_B.$
From R statistical software, a somewhat similar chi-squared test looks is shown below, also rejecting the null hypothesis. (Your text may show this test instead of the one above, or addition to it.)
Note: $\sqrt{17.365} = 4.167133 \approx 4.32$ from Minitab. I think the main difference is that R uses a 'continuity correction'.