I asked a similar question on Cross Validated but since I'm more interested on the theory behind choosing the right $N$, I decided to also post it here.
For the test, I randomly assign users to two distinct groups and track their Return Rate to the website (the return rate on each day for each group is simply $\frac{Number Of ReturningUsers}{TotalUsersOnDayZero}$).
I have two groups, $X$ and $Y$, and I'm calculating the $SEM$ of the following relation:
$$ Z = \frac{X-Y}{Y} $$
Where we can assume that the probability of Y being zero is zero.
Since the experiment runs for 7 days, I'm comparing the two groups day by day.
To calculate the $ SEM $ I need first to get the Expected Value and Variance of $Z$ using Taylor Expansions
- The Expected Value is calculated with
$$ E[\frac{X-Y}{Y}] = \frac{E[X-Y]}{E[Y]} - \frac{Cov[X-Y,Y]}{E[Y]^2} + \frac{E[X-Y]}{E[Y]^3}Var[Y] $$
- The Variance is calculated with $$ Var[\frac{X-Y}{Y}] = \frac{Var[X-Y]}{E[Y]^2} - \frac{2E[X-Y]}{E[Y]^3}Cov[X-Y,Y] + \frac{E[X-Y]^2}{E[Y]^4}Var[Y] $$
Then I calculate the $SEM$ with
$$ {SEM_i} = \frac{\sigma_i}{\sqrt{N_i}} $$
where $i$ represents the day
What is the right sample size ${N_i}$ to use in that case?
- is it the number of users that returned on that day ($Nx_i+Ny_i$)?
- is it the $max(Nx_i, Ny_i)$?
- is it the number of days I run the experiment (7)?
Again, here I'm trying to understand the theory behind choosing the right $N$.