Given results of an A/B Test which variation should be preferred in tests?

55 Views Asked by At

It’s my first question here, so please pardon if I don’t formulate everything correct.

Let’s imagine we are running an time-independent test of 2 variations that both have a probability to achieve a goal (f.e. conversion on a website).

Now let’s say we have done 10,000 tries and each variation got 5000 of those tries/users. One variation is leading, but we don’t know who the winner is with enough statistical significance.F.e. A has a 1.5% conversion rate and B a 1% conversion rate.

Is it still better to give both variations equal tests, or is it likely that more tests for A or B would lead us to a statistical significance faster?

2

There are 2 best solutions below

11
On BEST ANSWER

It depends on the variance of the population. If it turns out that conversions from A have a higher variance than conversions from B, then to achieve equal bounds on each estimate, you'll need more tests of the former than the latter.

It seems that in proportion testing people often assume the maximum variance for establishing the number of tests to perform (see the Wiki entry ). But in general, how "long" it takes to converge on a mean value will depend on the population variance.

2
On

This is really about power calculations, finding the sample sizes that maximise the probability that the test rejects the null hypothesis that the two conversion rates are equal, when in fact a specific alternative hypothesis is true.

It is sensitive to a large number of factors, including the precise test used, the statistical significance criterion used, and population rates, as well as the underlying distribution and parameters (here a binomial distribution - so discrete). There is also the question that you would be doing the test because you are not certain whether there is a difference, and so would not have a justification for making the sample sizes different if you do not have to.

The chart below gives my calculations for the power for different sample sizes for $B$ and the rest of the $10000$ for $A$, using a chi-squared test (no continuity correction) with significance $0.95$ if the true population probability of conversion is $0.015$ for $A$ and $0.01$ for $B$.

It suggests that the power in this case is maximised when the sample size for $A$ is slightly less than half, possibly because this gives $B$ more chance of producing some conversions close to the population proportion; this is not a simple variance argument, as the Bernoulli variance $p(1-p)$ is higher for $A$. It also suggests some local volatility: the power for $4572$ is about $0.6346$ while for $4573$ is noticeably smaller at about $0.6136$, but this is consequence of the precise values of $0.95$ and $0.015$ and $0.01$ and this step down could disappear if they were marginally different.

I think the message from this is that, given the choice, make the two sample sizes broadly equal. If for some reason that is not possible, then consider oversampling the less common scenario so as to make the effective sample sizes less unequal, in order to increase the power.

enter image description here