Why is lack of representation a sampling bias?

62 Views Asked by At

My approach to understanding sampling bias stems from the definition: Sampling bias is when some members of the population are not as likely to be chosen as others.

Therefore, suppose we conduct a survey where we randomly select 50 employees from a company completely randomly. I've been taught that when conducting such a study, to avoid sampling bias, you need to make sure to look out for lack of representation even if the sampling seems random. For example, the company might have a handicapped individual that we didn't pick. By not having them represented, there is bias.

My initial thoughts on this were: That individual had the same probability to be picked as everyone else, so there's no sampling bias. On the contrary, if we make sure to have the minorities represented, we would necessarily be introducing bias, as each member of the majority would now have a lower probability to be selected.

Where did I go wrong, and how should I be thinking about this?

2

There are 2 best solutions below

0
On

You are right: to avoid sampling bias, each individual from the population must be equally likely of being chosen in the sample.

Now, what you described with your example of the handicapped individual not being chosen, this may be related to what is known as Stratified Sampling where, before taking the sample, you partition your population into subpopulations (or strata) in order to reduce sampling errors.

0
On

In statistics, bias is a term used specifically to refer to the expected difference between an estimator and the property it is trying to estimate. If all members of the population have a non-zero probability of being included in the sample, then for any linear quantity in the population (e.g. count of people with a disability, average number of hours worked by an employee in a week) you can always construct a linear, unbiased estimator for that property by weighting each sample unit's contribution inversely proportional to their probability of inclusion.

In other words, if $\pi_i$ is the probability that employee $i$ is in the sample, and $y_i$ is the value of some property of employee $i$, then the estimator

$$\hat{Y}_\pi = \sum_{i \in S} \pi_i^{-1} y_i$$

is unbiased for $Y = \sum_{i \in U} y_i$, the sum of all values of $y_i$ in the whole organisation. For example, if every employee is given an equal chance of inclusion, then $\pi_i = \frac{n}{N}$ is just the fraction of people chosen in the sample, and $\hat{Y}_\pi$ is just scaling up the sample average by the total number of people. On the other hand, if you employ unequal probabilities of selection - for example, if you use a stratified sample with different probabilities in each stratum to ensure you get a good spread of employees across different groups - then you would adjust the weighting appropriately.

There are a whole bunch of subsidiary concerns - for example, whether you want to calculate estimates within the subgroups, and how much variability the various estimates may have - which form the basis of sampling theory which is an entire discipline in itself. There are also a variety of alternative estimators you can use that introduce a little bit of sampling bias in favour of a hopefully large reduction in overall error, which are the basis of the related discipline of estimator theory.