Sample size calculation - without any statistical information

437 Views Asked by At

I'm a computer science master's student doing a research related to the investigation of some aspects presented in a benchmark of Boolean functions. However, such as these aspects are complex to compute (things like power dissipation of a digital integrated circuit, which requires a lot of simulations), I have to define some candidates to be experienced.

So, I'm in trouble to define the sample size of the benchmark that I'll use. To be easier to understand, I'm working with a benchmark A (3982 functions, ie population), where I divide that catalogue into A1 and A2 (3183 and 799, respectively), following some particularities (you can think as two different classes). The table below exemplify that:

Benchmark |  A1   |  A2
    A     | 3183  | 799

I have no infos about standard deviation, mean or any statistic data of any parameter, since I didn't experienced yet. I just have these number of cases and the first step is define the set for the experiments.

Any idea about how to proceed in this case? Doing a research, I found this formula to define sample size n:

n = NZ²p(1-p)/[(N-1)e²+Z²p(p-1)]

Where N is the size of population, p is the proportion, e is the error margin (max) and Z is a value of standard deviation related to the confidence desired.

Setting p = 0,5, e = 10% and Z = 1,645 (90% of confidence), I found the values of the following table:

        |  A1   |  A2
Total   | 3183  | 799
Sample  |  67   | 63

Is that ok? Any other idea about how to calculate these sample sizes?

Thanks in advance!

1

There are 1 best solutions below

2
On BEST ANSWER

There are many formulas for determining the sample size required for various statistical analyses. From our discussion in Comments, I think it is worthwhile mentioning the formula for the sample size $n$ necessary for a particular type of confidence interval.

Let's assume your data are from a normal population and you want a confidence interval for the mean $\mu$ of that population, using the sample mean $\bar X$ as an estimate of $\mu.$

In order to determine the necessary sample size $n,$ you need to know (or have realistic estimates) of three things:

a. The population standard deviation $\sigma.$

b. The desired level of confidence (usually 90%, 95%, or 99%).

c. The tolerable margin of error $M.$ The confidence interval will be of the form $\bar X \pm M,$ so the total length of the CI will be $2M.$

Of course it is desirable to have a high level of confidence and a short interval. In practice, one usually has to try out a few choices under (b) and (c), and then be realistic about what is attainable, given the time and money at hand.

The formula is $$n \approx \left(\frac{z_{\alpha/2} \sigma}{M} \right)^2,$$ where the level of confidence is $100(1 - \alpha)$% and $z_{\alpha/2}$ cuts probability $\alpha/2$ from the upper tail of the standard normal distribution. This number can be obtained from a printed table of the standard normal CDF. For example, if you want a 95% CI, then $\alpha = .025$ and $z_{025} = 1.96.$ If you want a 90% CI then use $z_{.05} = 1.645.$

In practice, one seldom knows the exact value of $\sigma.$ Then a sample standard deviation $S$ from a pilot experiment, or from a previous experiment with similar measurements from a similar population, can be used. Technically, then one should use a more complicated formula based on Student's t distribution, instead of the normal distribution. But if the value of $n$ turns out to be above 30 for a 95% CI (or above 20 for a 90% CI), then the formula above is OK.

More speculatively, one might appeal to the Empirical Rule. About 2/3 of the probability under a normal curve lies within $\mu \pm 2\sigma.$ If you guess $\mu$ for the heights of 20-year old US males is about 69 inches, and that about 2/3 of them might be between 62 and 76 inches tall, then you'd guess that $\sigma \approx 3.5,$ which is about right.

If you are in a totally new situation and nobody has any intuition at all what variability to expect, then just get started taking data and then estimate $\sigma \approx S,$ where $S$ is based on a couple of dozen initial observations. Then use the formula displayed above to see how many more observations you're likely to need in order to get a CI with the desired specifications.

If you don't have data that are normal, but are from some other distribution, then you can look at the formula for making CIs based on data, and then try to deduce the required sample size from that. After all, the displayed formula is a only few steps of algebra away from the formula $\bar X \pm 1.96 \sigma/\sqrt{n},$ for a 95% CI of the mean $\mu$ of a normal population, based on the sample mean $\bar X$ and knowledge of $\sigma.$ Just set $M = 1.96 \sigma/\sqrt{n}.$

Note: However, bear in mind that not all CIs are of the form 'point est' $\pm$ 'marg of error', as for the normal mean $\mu.$ For example, a 95% CI for normal $\sigma,$ using the sample SD $S$ of $n$ observations, is based on the fact that $(n-1)S^2/\sigma^2$ has a chi-squared distribution with $n-1$ degrees of freedom.