Avoiding collinearity in logistic regression

418 Views Asked by At

I have the following problem: I'm performing a multivariate logistic regression on several variables each of which has a nominal scale. I want to avoid multicollinearity in my regression. If the variables were continuous I could compute the variance inflation factor (VIF) and look for variables with a high VIF. If the variables were ordinally scaled I could compute Spearmon's rank correlation coefficients for several pairs of variables and compare the computed value with a certain threshold. But what do I do if the variables are just nominally scaled? One idea would be to perform a pairwise chi-square test for independece, but the different variables don't all have the same codomains. So that would be another problem. Is there a possibility for solving this problem?

Thanks in advance!

1

There are 1 best solutions below

0
On

Strictly speaking, multicollinearity is a concept which is unique to real-valued covariates. It's also a specialized indicator of dependence between covariates. You have the right idea in that it's important to avoid covariates that are too "mutually dependent" - even if they have categorical domains. In fact, you can actually use the idea of multicollinearity to motivate a statistic to measure such dependence among categorical covariates. I'll try to do this through an example.

Suppose you have two covariates, $X_1$ and $X_2$, which are both continuous. The VIF is equal to $1/(1-R^2)$, where $R^2$ is the squared-correlation between $X_1$ and $X_2$. Thus, multicollinearity simply depends how well $X_1$ predicts $X_2$ as assessed by $R^2$.

Now, suppose $X_1$ and $X_2$ are categorical. You can still use $X_1$ to predict $X_2$ (e.g. through a logistic regression). But you wouldn't assess the quality of the prediction using the usual $R^2$ in this case. There are many different pseudo-$R^2$ formulas you could consider. I think one useful approach would be to consider one based on the entropy reduction. For instance, the marginal entropy, $H(X_2)$ of $X_2$ can easily be estimated. Additionally, the conditional entropy, $H(X_2|X_1)$ can be estimated nearly as easily. You can then develop a pseudo-$R^2$ with the formula, $$R^2_{pseudo} = 1 - \dfrac {\hat{H}(X_2|X_1)} {\hat{H}(X_1)}$$ Large values of $R^2_{pseudo}$ would then imply increased dependence among the covariates and small values would indicate otherwise.