Hi I am having some trouble with the following;
So I have some data set, it contains an outcome of satisfaction , it also contains four predictors, 3 continuous , age, weight, height, and one factor predictor, either graduated high school yes or no.
So In R, I have uploaded the data set, and set $X1$ for age, $X2$ for weight , $X3$ for the factor and $X4$ for height.
I want to know if there is evidence that graduating high school has an effect on satisfactions.
But here are some things: I know that I can not simply look at lm(y~x3), because I need to consider all the other possibilities. So how do I take all of these into account? How many models must I check? What is the general approach to this?
I can do lm on diffirent models for example the full model, or the model just excluding x3. Do I just need to look for when $R^{2}$ values change?
Also, would I need to consider any and all possible interactions? Any advice/general guidelines for this?
This problem can be decomposed into several pieces:
Make a hypothesis about what independent variable will highly affect the outcome of satisfaction, can you confidently include the these four independent variable. If the answer is yes, you can try a full model with four X's. If you're not quite confident, you can use Likelihood ratio test to test different model with different variables.
Are they linearly correlated? If the answer is yes(or you don't have any further information, for brevity, only use linear model), you can try the basic generalized linear model. In R, you can use "glm" to fit the model: $$Y=\beta_1 * X_1 +\beta_2 * X_2 + \beta_3 * X_3 + \beta_4 * X_4 + \epsilon$$
Goodness of fit, check $R^2$, t-test for each coefficient, F-test for whole model.
Build confidence interval for coefficients. And interpret your results.
If you're not satisfied with your model or forecasting power, there are two directions you can try.
In general, you can try all possible models if you want, but keep in mind, $R^2$ is not the only thing you should look at. You need more reasonable model rather than a high $R^2$ garbage.