A data analysis question on finding a model has a strong linear relationship but also a model that will be useful for prediction and inference.

35 Views Asked by At

Suppose I have a response (dependent) variable $Y$ and 15 predictor (indepentent) variables $X_1, X_2, ..., X_{15}$. I have the 30 data measurements for both the response vaiable and each predictor variables. Which sample space $n=30$

$\mathbf{The~question~is}$:
How can I find a model exhibiting a strongest simple linear relationship
bettween the response (dependent) variable and one of 15 predictor (indepentent) variables,
and this model need to be useful for prediction and inference (new datas).

$\mathbf{My~Thought}$:
I can find a strongest simple linear relationship bettween the response (dependent) variable and one of 15 predictor (indepentent) variables
maybe by Doing the F statistic tests for all 15 predictor variables to see if there exists significant linear relationship,
and Find the $R^2$ for all predictor variables to see how much variation the regression line explains.
(the closer for $R^2$ to 1, the better the model is doing.)

$\mathbf{Problems}$:
Right now I don't know are those enough for finding a model exhibiting a strongest simple linear relationship
bettween the response (dependent) variable and one of 15 predictor (indepentent) variables ??

Also, I don't know what should I do to make sure this model is useful for prediction and inference (new datas) ??

1

There are 1 best solutions below

0
On BEST ANSWER

First, built up all 15 models with the responce variable $Y$.
Then, find out the $R^2$ and also do the F-test for all 15 models to see weather each model is significant.
Then, find out outliers and leverages for each the 15 models, and find out if the leverages or outliers are needed to remove from each the models.
Then, look into the standard residuals plots, fitted value plots, and qq plots for each 15 models to determine weather each model is violate the 4 assumption for simple linear regression. (e.g. if there is a fanning pattern, then that is means the model violates the constant variance assumption; and if there is a curvy pattern, then that is means the model violates the assumption of normality) Last, Then, do a transformation for the model that violate the constant variance assumption and assumption of normality.
Last, check which simple linear regression model has the highest $R^2$. Also, find the confidence interval and the prediction interval and indicate that this model is useful for prediction and inference.