Linear regression model with 2 categorical variables

Question

Linear regression model with 2 categorical variables

211 Views Asked by Bumbble Comm At 11 May 2026 - 4:13

Let's consider the following problem : We want to predict a variable $y$ and we have two categorical variables : $A$ that can take 3 different values and $B$ than can take 2 different values.
A regression model with interaction would be : $$y = \sum_{k=1}^{3}\alpha_k\mathbb{1}_{A_k}+\sum_{k=1}^{2}\beta_k\mathbb{1}_{B_k}+\sum_{i,j}\gamma_{ij}\mathbb{1}_{A_iB_j}$$ Another equivalent formulation : $$y = \mu +\sum_{k=1}^{3}\hat{\alpha_k}\mathbb{1}_{A_k}+\sum_{k=1}^{2}\hat{\beta_k}\mathbb{1}_{B_k}+\sum_{i,j}\hat{\gamma_{ij}}\mathbb{1}_{A_iB_j}$$ In the second formulation, do we have a "meaning" for $\mu$ ? Is it the mean of $y$ across all categories ? Is there any advantage of using one formula over the other in this setting ?
Also, if we consider's R output, which doesn't keep the first category for each variable, how would we interpret its intercept coefficient ?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

The issue with your formulations is that different coefficients would give the same answers. For example in the first expression you could add a constant $c$ to all the $\alpha_k$ and subtract the same $c$ from all the $\beta_k$ and get the same $y$, or in the second add $c$ to $\mu$ and subtract the same $c$ from all the $\hat{\beta}_k$

To give you an answer, R has used this property to set $\hat{\alpha}_1$ and $\hat{\beta}_1$ and all the $\hat{\gamma}_{1n}$ and $\hat{\gamma}_{m1}$ to zero, and this can allow a unique solution

In this case, the intercept $\mu$ is the intercept (or since you have no numerical independent variables, the predicted value) when the independent variables correspond to the first values of the factors. As an illustration, consider this toy example:

mydf <- data.frame( a=c("F","F","F","G","G","G","H","H","H","H"),
                    b=c("S","S","T","S","T","T","S","S","T","T"),
                    y=c( 1 , 3 , 5 , 3 , 6,  8,  3 , 5 , 4,  6 ))
fit <- lm(y ~ a * b, data=mydf)

to give

> fit

Call:
lm(formula = y ~ a * b, data = mydf)

Coefficients:
(Intercept)           aG           aH           bT        aG:bT        aH:bT  
          2            1            2            3            1           -2

The intercept of $2$ corresponds to the prediction when $a$ is "F" and $b$ is "S". It is in fact the average of the first two $y$ values in the dataframe of $1$ and $3$, as you might intuitively expect. Then for example

when you add the bT value of $3$ to the intercept to get $5$ you have the third value in the dataframe, the prediction when $a$ is "F" and $b$ is "T"
when you add the aG value of $1$ to the intercept to get $3$ you have the fourth value in the dataframe, the prediction when $a$ is "G" and $b$ is "S"
when you add the aG, bT and aG:bT values to the intercept to get $7$ you have the average of the fifth and sixth values in the dataframe, the prediction when $a$ is "G" and $b$ is "T"

Linear regression model with 2 categorical variables

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in REGRESSION

Related Questions in LINEAR-REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions