Following on from this comment on an answer to my previous question, I'd like to know two things:
- what the best statistical test I can use to measure significance on the experiments I'm running? (previously it was stated that I could potentially use z-tests or Fisher's Exact test)
- how can I measure the required cohort size needed for each experiment to achieve reasonable power?
Here's some information on the experiments I'm running -- happy to provide more if needed:
- Each experiment will have an A cohort (the control) and a B cohort (which will see the treatment).
- Most experiments will only be run on cohorts of 30-200 participants
- I'm only looking for B's that have a positive increase over A (one-sided).
- I also expect that if there is a positive increase in B that it will be a rather large increase (> 100% increase relative to the control).
- Finally, the A cohort will generally have a low success rate (< 10%), so we cannot rely on the fact that the sampling distribution is approximately normal.
You can see some example data in my previous question.
First, I took some time to verify that the z-test does not work well when the success probability in the control group is as small as 10%.
Second, here are some results using a one-sided Fisher's exact test that rejects the null hypothesis that success probabilities in the two groups are equal when there are significantly many more successes is the treatment group than in the control group. (This means that you would disregard as a fluke any results with significantly more successes in the control group.)
All of the results below are for Fisher's exact test, and sample sizes are equal in the two groups. I looked at cases for $n = n_T = n_C = 50, 100,$ and $200.$
$n = 50.$ Suppose the success probability in the control group is $\pi_C = 0.02$: If $\pi_T = 0.15,$ then the P-value averages $.07.$ If $\pi_T = 0.2,$, the average P-value decreases to $.022.$ And if $\pi_T = 0.25,$ the average p-value decreases to $.007.$ This is summarized in the first cluster below, and the second cluster is for $\pi_C = 0.1.$
I hope you can see that this gives you a rough idea what differences between $\pi_C$ and $\pi_T$ can be reliably detected and at what level of significance, for each of the three sample sizes. All average P-value results are based on simulation and are subject to small simulation errors.
Examples with $n = 100$ and control group with population proportion of successes $\pi_C = .10$: At the 5% significance level, you will seldom be able to detect that $\pi_T = .20$ is an improvement, usually be able to detect that $\pi_T = .25$ is an improvement, and seldom overlook that $\pi_T = .30$ is an improvement.
If you like, I can show you the R code I used to get these results. Then you could investigate other scenarios. R is available free at
www.r-project.organd no particular knowledge of R would be necessary to change numbers in my program and run additional scenarios.Finally, I would not trust even Fisher's exact test (any sample size) unless the number of successes in the treatment group is at least 5.
Addendum: R code for Fisher exact tests. As requested, here is the R code used to obtain the information tabled above. Answers for one of the specific tabled situations is shown. Constants in the first two lines of code may be changed to investigate other situations. (Values for power, included here, are not tabled above.)
Plots of simulated P-values are shown in the histograms below. Scenario (a) is for $n_C = n_T = 50;\, \pi_C = .1, \pi_T = .35$ and in Scenario (b) $\pi_T = .25.$ The vertical dotted red lines are at $0.5,$ so the bar to the left of the line represents the power of the test, the probability of rejecting $H_0: \pi_T = \pi_c$ against the alternatives $H_a: \pi_T > \pi_C$ (as specified), at level $\alpha = 5\%.$
Perhaps the first use of this code should be to verify the values in the table above to make sure there are no misprints.