Pros and cons of including controls in a regression?

Question

Pros and cons of including controls in a regression?

591 Views Asked by Bumbble Comm At 28 Mar 2026 - 11:37

Assume we have conducted a random experiment for the benefits of a drug. Let $Y_i$ be the outcome of interest , $X_i$ be some control variables (e.g. age, sex etc.) and $$D_i= \left\{\begin{array}{cl} 1 & \textrm{if individual $i$ received the drug} \\ 0 & \textrm{otherwise} \end{array} \right.$$ What are pros and cons of the following specifications for estimating the effect of the drug, $\beta_1$,

(1) $Y_i=\alpha +\beta_1 D_i +\varepsilon_i$

(2) $Y_i=\alpha +\beta_1 D_i +\beta_2X_i + \varepsilon_i$

(3) $\Delta Y_i=\alpha +\beta_1 \Delta D_i +\varepsilon_i$

Edit: First, because of randomization, the error should be independent of control variables (and treatment), thus both including / excluding controls will lead to a consistent estimator.

If the control helps explain $Y_i$ then including the controls will decrease variance of the estimator of $\beta_1$. If there is sample correlation between the treatment and controls then including the controls will increase variance of the estimator of $\beta_1$. We should include controls when the sample is relatively large and the controls are unbalanced.

Finally, we may think that there is a different effect depending on the controls in which case we may want to include an interaction term of treatment and controls.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Model (2) has the merit that it takes into account any (linear) correlation of the result with the control variables, thus partly eliminating the confounding effect of nuisance variables. For example, if the only control variable is $X = 0$ for male and $X=1$ for female, and the experiment was performed on 7000 male subjects and only 3000 female subjects, and the drug had benefits for males but equal negative effects for females, then model (1) would say the drug was beneficial, while model (2) would expose the fact that it is not beneficial for females.

When the experiment population size is small, however, trying to correlate with several control variables $X_i$ can result in too many degrees of freedom and a weak $\chi^2$ value. and model (1) would be better. For example, in that same experiment, if the sample size were 12 males and 4 females, model 2 might give no trustworthy information while the sample size of 16 with one fewer degree of freedome might let us say something in model (1).

Model (3) has the same benefits and drawbacks as model (2) but is applicable only in the case where it is sensible to group values of $X_i$ into small number of disrete categories. For example, in the 10000 person sample described above, it might be write to use model (3) rather than model $2) since the control variable is discrete.

The most important thing is to determine which model will be used before examining the data (other than knowing how many samples are taken with each value of the control variables). If you look at the data before fixing the anaysis technique, then statistical honesty forces you to weaken your stated results due to the "look elsewhere effect", where you have to ask how likely it is that any plausible analysis scheme would give a result with this degree of confidence. Such analyses are difficult except by Monte-Carlo techniques.

Pros and cons of including controls in a regression?

There are 1 best solutions below

Related Questions in REGRESSION

Related Questions in STATISTICAL-INFERENCE

Trending Questions

Popular # Hahtags

Popular Questions