Equation for estimation of sample size is a quadratic?

297 Views Asked by At

The equation for calculation of sample size for a prevalence study happens to be $$\ n= \frac {Z p (1-p)}{e^2}$$ where $Z$ is the $Z$ score, $e$ is the precision we want to achieve and $p$ is the 'original truth'.

So my doubt is, why is this equation a quadratic? Say the prevalence is zero, then the sample size becomes zero. The same is the case with prevalence being 100%. That means, the estimated sample size is going to be low when the prevalence is either too high or too low.

A low estimated sample size for a high prevalence makes sense but what is the rationale behind a low sample size for a low expected prevalence?
In short why is this equation a quadratic? Shouldn't it be linear? What's the derivation?

1

There are 1 best solutions below

2
On BEST ANSWER

I am attempting an answer with a partial understanding of the involved mathematics (and the subsequent intuition underneath), so please do let e know if there is any obvious error of interpretation.

INTUITIVE

The formula is maximised at $p=0.5$. The derivation of the formula is at least not related to the intuition of the quadratic nature of the parameter $n$, since the derivation directly involves plugging of the formula for Bernoulli variance which contains the $p$ terms already, causing the quadratic outcome.

The formula tells us the minimum sample size which we require, with which we can say that the prevalence we measure in the sample will be with a certain measured statistical confidence represented by $Z$. If we use a $Z$ value corresponding to a $95\text%$ confidence interval, it means that the measured prevalence in the sample will have a $95\text%$ chance of lying within $p+e$ to $p-e$. Variance of a Bernoulli distribution like this one where there are only two outcomes, can be understood, either just mathematically (purely as an outcome of well defined operations) or as the variance of the probability distribution of the means of the possible samples of size $n$ (it will be directly proportional to it as far as the $n$ is fixed). This means that if you were to repeatedly sample from the given population with the same fixed size, and measure $p$ for each, and plot its distribution, the variance of this distribution will be directly proportional to the variance used in the formula. Hence, for a fixed size, the variance (the $p$ terms, which for the purpose of discussion can be assumed to be the variance of the sample-distribution also) will be directly proportional to $e^2$. This means that greater the variance (maximum at $0.5$), larger the value of $e$ needed to have the same level of statistical confidence ($Z$). It implies, that to have the same $e$ for a $p$ with a greater variance, we will need a larger sample size which again maximizes at $0.5$.

Why is this variance maximum at $0.5$? It is a simple matter to realise that for $p_1$ and $p_2$ such that $p_1=1-p_2$, the variance will be the same because it is mathematically equivalent, just a matter of convention that we choose to consider one probability more relevant. It is immaterial to the variance whether we are asking the number of individuals with the disease or without. Variance involves the square of distances of the individual observations ($1$ and $0$) from the mean. At $p=0.5$, the mean is $0.5$ and the distances from the poles ($1$ and $0$) are maximum, and hence, their sum is also maximum. Let us assume that we change $p$ slightly. What we essentially did was to bring the mean closer to one of the poles, the frequency of which will now be greater in the observations (i.e the pole whose probability was increased will appear in the data more often than the other pole). This means that the mean came closer to one of the poles (the squared distance fell) and the pole also increased in frequency and hence the fallen distance is now represented more in the variance calculations! Although the other pole moves farther and hence the squared distance rises, but this is outweighed by the changes in frequency. So the net result of moving the mean from in between is a loss of variance. (This is exactly similar to the case of moment of inertia in case of physics, wherein the outcomes here (the "poles") are the weights placed at the end of a rod and we are calculating the moment of inertia (involves a squared term) about a point between them. It is maximised when they are equidistant from the point.) This fall in variance can be showed mathematically too.

MATHEMATICAL

The $p$ here can also be viewed as the probability of any randomly picked individual from the population having the disease. Hence, randomly picking an individual is similar to tossing a biased coin, where "heads" represents having the disease and the probability of "heads" is $p$. Let the Random Variable $X$ represent the outcome of this biased coin, such that $1$ is "heads" and $0$ is "tails". According to the Bernoulli probability distribution (comment if an explanation for this is needed) $$f(x)=p^{x}(1-p)^{1-x}$$ the variance can be worked out as follows:- $$\text{Var(X)}=\text{E[(X-}\mu^2\text{)]}$$ $$\text{=E[X}^2\text{]-E[X]}^2 \text{( arithmetic rearrangements)}$$ Using this, and plugging in values, it turns out to be $p(1-p)$.

Assuming $p=0.5 + x$ where $x$ is any distance by which we move from the centre (i.e by which the expected value moves from $0.5$ due to a change in probability, we can show that the variance will be $0.25 -x^2)$, where for $0<=x<=0.5$, $0<=x^2$, and hence, the variance will decline for all $x$.

enter image description here The graphs represent various sample-distributions, i.e. if you were to sample using different $p$ and $n$ repeatedly and then plot the observed prevalence. $A$ is the distribution with $n_1, p_1$,$B$ with $n_2, p_2$ and $C$ with $n_3, p_2$, such that $p_1=0.5, p_2<p_1, n_1=n_2>n_3$. As is apparent, at a higher variance, wither a larger $e$ might be required to have the same area in the bounds (representative of statistical confidence), or a larger $n$ which "streamlines" the graph.