Regression with multiple categories

72 Views Asked by At
  1. If you have a table of data $Y = f(X_0, X_1,\ldots, X_n)$, you can perform linear regression to find coefficients such that $$Y \approx a_0X_0 + a_1 X_1 + a_2 X_2 + \ldots + a_nX_n,$$ provided the variables $X_i$ are all numerical.

  2. In the case that one of the variables— say $X_0$— is categorical, taking on values from a discrete set of labels $x_1,\ldots, x_k$ instead of numerical values, one way to still perform linear regression is to replace $X_0$ with $k$ binary variables $B_1, \ldots, B_k$. These indicator variables are mutually exclusive and exaustive, with $B_i=1$ if and only if $X_0=x_i$, and $B_i=0$ otherwise.

    The linear regression then looks like: $$Y \approx (b_1 B_1 + b_2 B_2 + \ldots + b_k B_k) + a_1X_1 + a_2 X_2 + \ldots + a_nX_n$$

  3. Alternatively: I have such a dataset where one of the variables $X_0$ is apparently categorical, taking on values in a set $x_1,\ldots, x_k$. But I believe I can find a consistent numerical interpretation of $X_0$ as a numerical variable $U$. That is, I believe there are numbers $u_1, \ldots, u_k$ such that $f(X_0=x_i,\ldots, X_n)$ and $f(U=u_i, X_1, \ldots, X_n)$ agree on all values in the dataset, and such that $f(U, X_1,\ldots, X_n)$ is a linear function of all its arguments.

  4. Questions:

    • Assuming $f$ is exactly linear in its numerical arguments, how can I determine appropriate numerical analogues $u_1, \ldots, u_k$ for the categorical labels $x_1,\ldots, x_k$ of the categorical variable $X_0$? (I assume this is just a matter of linear algebra, but I'm not sure about the details).

    • Assuming $f$ is almost exactly linear in its numerical arguments, is there a regression procedure I can perform to find $u_1, \ldots, u_k$ which fit as well as possible? (And is this regression question well-defined, or might there be many solutions? Here I'm considering cases where there is a nonzero dependence on $U$ and considering solutions to be equivalent up to uniform scaling of the values $u_1,\ldots,u_k$.)