Can I run a regression when both independent and dependent variables are all dichotomous?

639 Views Asked by At

I have conducted a survey where all my questions are asked in a dichotomous manner (Yes/No).

Eg IV:"Are you a smoker?", "Are you obese", "Is your gender male/Female" etc. DV: "Have you ever had a stroke?"

Therefore both my dependent variable and independent variables are all dichotomous(Binary= measured in 0s and 1s).

My question is, is it appropriate to run a regression to determine the independent variables that drives the dependent variable given the fact that every single one of my variables (both dependent and independent) are dichotomous in nature?

If so, what kind of regression is the most appropriate? (Logistic regression?) and is there anything I should do to make the regression model more accurate?

I have rudimentary understanding of statistics and regression modelling and would be so grateful if someone would point me in the right direction.

1

There are 1 best solutions below

2
On

Logistic regression is indeed one possible way to model your data. It expresses the probability for the event of interest as a function of some independent variables $X$, that can be either categorical (in particular, dichotomous) or continuous. So, $$ P(Y=1|X)=\frac{1}{1+e^{-\beta'x}}, $$
such that if $X_1 \in \{1,0\}$, lets say $1$ stands for female and $0$ for male, then the odds ratio of your event of interest between females and males is $e^{\beta_1}$.

Another possible, rather simpler way, to analyse these kind of data is to use contingency tables. This approach can help with $2$ and sometimes $3$ categorical variables.


On the technical side, you have to assure that your data satisfy the requirements in order to estimate the model. Another issue that you should address is that you have "enough" data points at each level and each class in order to produce more reliable estimators. On the conceptual side, you should ask yourself whether you post the problem as probability estimating problem (then logistic regression is a good choice) or binary classification (in this case, logistic regression may fit as well, but there are plenty other classifiers) or just dependence/independence problem (in this case, in a presence of two binary variables, a simple Pearson's chi-squared test will do the job).