Finding the Correct Sample Size N to calculate the $SEM$ in a Relative A/B Test with $X \sim B(p_1,n_1)$ and $Y \sim B(p_2,n_2)$

33 Views Asked by At

I asked a similar question on Cross Validated but since I'm more interested on the theory behind choosing the right $N$, I decided to also post it here.

For the test, I randomly assign users to two distinct groups and track their Return Rate to the website (the return rate on each day for each group is simply $\frac{Number Of ReturningUsers}{TotalUsersOnDayZero}$).

I have two groups, $X$ and $Y$, and I'm calculating the $SEM$ of the following relation:

$$ Z = \frac{X-Y}{Y} $$

Where we can assume that the probability of Y being zero is zero.

Since the experiment runs for 7 days, I'm comparing the two groups day by day.

To calculate the $ SEM $ I need first to get the Expected Value and Variance of $Z$ using Taylor Expansions

  1. The Expected Value is calculated with

$$ E[\frac{X-Y}{Y}] = \frac{E[X-Y]}{E[Y]} - \frac{Cov[X-Y,Y]}{E[Y]^2} + \frac{E[X-Y]}{E[Y]^3}Var[Y] $$

  1. The Variance is calculated with $$ Var[\frac{X-Y}{Y}] = \frac{Var[X-Y]}{E[Y]^2} - \frac{2E[X-Y]}{E[Y]^3}Cov[X-Y,Y] + \frac{E[X-Y]^2}{E[Y]^4}Var[Y] $$

Then I calculate the $SEM$ with

$$ {SEM_i} = \frac{\sigma_i}{\sqrt{N_i}} $$

where $i$ represents the day

What is the right sample size ${N_i}$ to use in that case?

  1. is it the number of users that returned on that day ($Nx_i+Ny_i$)?
  2. is it the $max(Nx_i, Ny_i)$?
  3. is it the number of days I run the experiment (7)?

Again, here I'm trying to understand the theory behind choosing the right $N$.