A/B Test: t-test vs. chi-squared test

439 Views Asked by At

I am working on a marketing campaign analysis between treatment-A and treatment-B. The goal is to find which treatment yields more clicks. Say, I have one month of experiment duration and at the end I want to determine if one treatment is statistically better than the other. I can formulate the problem in two ways:

  1. t-test: I find the average daily click rates of A and B, and then perform a t-test.

  2. chi2 test: I find the total clicks of these A/B treatments at the end, and perform a chi2 test.

Can someone comment on the differences of these two approaches?

2

There are 2 best solutions below

6
On
  • I'll start with the chi-squared test. You said you want to compare the total clicks at the end. But wouldn't this depend on how many people visit the sites that were using treatment A versus treatment B? If treatment A only had 250 visitors and all of them clicked, it would probably be better than if treatment B had 1000 visitors and 500 clicked. So you might be interested in comparing the proportion of clicks. Say you have the following data after one month:
clicked   A    B
yes       500  500
no        600  650

Would you say that A is the same as B? Probably not, because more people visited the site under treatment B and it had less of a click rate. At this point, you can do a 2-proportion test or a chi-square test to compare if there is a difference between treatment A and treatment B in terms of getting people to click. They are the same test. Here is how you would do it using a 2-proportion test.

Hand-calculating the z test statistic

AY=500
BY=500
AN=600
BN=610

#prop test calculation
p1=AY/(AY+AN)
p2=BY/(BY+BN)
phat=p1-p2
ppooled=((AY+AN)*p1+(BY+BN)*p2)/(AY+AN+BY+BN)
zstatistic=phat/sqrt(ppooled*(1-ppooled)*(1/(AY+AN)+1/(BY+BN)))
zstatistic^2
pnorm(zstatistic, lower.tail=FALSE)*2

-out

> zstatistic^2
[1] 0.03739652
> pnorm(zstatistic, lower.tail=FALSE)*2
[1] 0.84666

So we get a p-value of 84%, suggesting there is no difference between the two proportions. (This is not too surprising, because I constructed the proportions to be quite close together.) The square of the z-statistic that we just computed is actually the chi-squared statistic, as we will see later on.

Using the prop.test command We can also directly use the built-in prop.test command.

prop.test(c(AY, BY), c(AY+AN, BY+BN), correct=FALSE)

-out

    2-sample test for equality of proportions without continuity correction

data:  c(AY, BY) out of c(AY + AN, BY + BN)
X-squared = 0.037397, df = 1, p-value = 0.8467
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.03740849  0.04559850
sample estimates:
   prop 1    prop 2 
0.4545455 0.4504505 

From the third line, you can see that the chi-sq test-statistic (which is the square of the z-test statistic) and p-value both match from previously.

Chi-squared by hand Now here is how to do the chi-square test by hand. Note that it is exactly the same test as we have just previously done. The effect of it is to compare the observed data to what we would have expected if there is no difference in treatments. That is, it compares the first table with the following table, called the expected table.

Expected
clicked  A          B
yes      497.7376   502.2624
no       602.2624   607.7376

The code is as follows

#chi-squared test calculation
tab=data.frame(clicked=c("yes", "no"), A=c(AY, AN), B=c(BY, BN))
rate=(AY+BY)/(AY+BY+AN+BN)
AYE=rate*(AY+AN)
ANE=AY+AN-AYE
BYE=rate*(BY+BN)
BNE=BY+BN-BYE
chisqstatistic=(AYE-AY)^2/AYE+(ANE-AN)^2/ANE+(BYE-BY)^2/BYE+(BNE-BN)^2/BNE
chisqstatistic
df=1
pchisq(chisqstatistic, df, lower.tail=FALSE)

-out

> pchisq(chisqstatistic, df, lower.tail=FALSE)
[1] 0.84666
> chisqstatistic
[1] 0.03739652

Chi square test command Or, using the built-in command, we get

> test=chisq.test(tab[,2:3], correct=FALSE)
> test

    Pearson's Chi-squared test

data:  tab[, 2:3]
X-squared = 0.037397, df = 1, p-value = 0.8467

> pchisq(test$statistic, df, lower.tail=FALSE)
X-squared 
  0.84666 
  • The difference between the preceding chi squared test and the prop test is the for the prop test, you can specify a one-sided hypothesis. So you could test if the proportion for B is greater than A or vice versa.
  • You also say conduct a t-test for average daily click rates. So I presume you mean you would take the sum of the click rates and then divide by 30, giving you the average clicks per day? Again, I'm not sure this is a good idea because the visits to the sites for treatment A and treatment B may not have the same traffic and be directly comparable. However, if we were to assume that they have exactly the same traffic, it is conceivable you would be able to conduct a t test to test the difference in means for any measurable quantity, such as clicks per day. So you would have a sample of clicks per day under treatment A and treatment B and compare the means of those samples. This is conceivable, again, if treatment A and treatment B experience the same traffic.

Conclusion

The two methods you propose are fundamentally different. One tests if there is an overall difference in the proportion of clicks garnered by treatment A versus treatment B and can be conducted if there is a difference in number of sites visited under treatment A and B. The other method requires that the number of sites visited under treatment A and treatment B are the same and then compares the clicks per day of the two treatments.

6
On

I'm by no means a statistician, so what I write below is just some common sense advice.

The Student test is the canonical way to go for the mean comparison, but it relies on having a few samples from each of the two distributions. If you have a long streak of clicks, then you are welcome to split it into several subsets but you need to be reasonably sure that the distribution is the same and approximately normal within each subset. The splitting into days is very questionable because it is hard to justify that the people clicking habits do not change from day to day. As to myself, when I come home after a work day and go to the web, I definitely do not want any stupid ads, but on weekends I may be more benevolent if I'm in a really good mood. The Student test is robust in the sense that if you do any splitting (of course, decided upon in advance and without looking at the data) and it shows statistical significance, then it is, indeed, there. But if you split into subgroups with different underlying distributions, then your empirical variance may be a noticeable overestimate of the true variance and you lose sensitivity, so if the change in the clicking means from day to day is comparable to or greater than the difference in overall means between design A and design B, with daily averages you'll detect nothing even if the difference is there. Approximately normal is usually not a problem as long as you have not too few members in each subset.

So, if you have a timestamp on, say, users' entering the page, I would definitely go for Student, but I'd split differently. I would partition the whole observation period into 10 minute intervals and within each interval assign 1 minute to each group (I would even rotate or randomize that assignment from one 10 minute interval to another). That will give me more groups to estimate the variance and I will be way more sure that the underlying distributions for different groups are the same. By the way, I would also do it to compare the daily means in this way, and, if I see that there, indeed, exists a significant difference between them, request that the observations be complemented by two missing days of the week before doing any analysis because the natural human life cycle is a full week, not 5 days (just make sure that you don't hit holidays, etc.)

What to do if you don't have the timestamps but only the daily totals? Then, as I said, you may still perform Student and if it shows the difference, it will be there, but if it doesn't, then you may just suffer from the effect I described. So in this case I would go for a quick and dirty version of $\chi^2$.

Note that the original Pearson $\chi^2$ test is designed for a different situation that you are in (just look up the assumptions and the setup on the Wikipedia page or in a textbook), so I doubt you can run it the way you intend with a meaningful result. However, we can get something similar from the first principles. Suppose that $A,B$ are the total numbers of users reached and $X,Y$ are the corresponding numbers of users who clicked. You want to know if the difference has occurred by pure chance or not. If you accept that all users and design effects are the same and there is some underlying probability $p$ to click, then the distribution of the vector $(\frac{X-Ap}{\sqrt{Ap(1-p)}},\frac{Y-Bp}{\sqrt{Bp(1-p)}})$ should be approximately standard 2D normal and, in particular, $$ \frac{(X-Ap)^2}{Ap(1-p)}+\frac{(Y-Bp)^2}{Bp(1-p)} $$
would have the $\chi^2$ distribution with 2 degrees of freedom. Since you don't know $p$, you have to take the infimum over it, so your actual statistics is $$ \inf_{p\in(0,1)}\left[\frac{(X-Ap)^2}{Ap(1-p)}+\frac{(Y-Bp)^2}{Bp(1-p)}\right] $$
This is not exactly $\chi^2$ but it is concentrated better than $\chi^2$, so you still can use the $\chi^2$ table for it. When $A,B$ are not to small and the probabilities to click are not too close to $0$ or $1$, this approximation is fairly good. This is robust too (in the above sense) and will also suffer from the loss of sensitivity if the probability to click depends significantly on the day or on the user group, but it is still better than Student with a bad partition (though worse than the Student with a good one).

Just my two cents. Comments from people who know better are invited :-) (Questions from people who know worse too).