Start with PCA and multiple regression or Start with multiple regression and PCA

164 Views Asked by At

I would like to know something easy but very important.

Imagine I have a database with 0 NA, a perfect database who has been clean. And I have to do a PCA on this database. This datebase got a lot of individuals and variables ( 95 individuals and 10 variables)

I have to do a multiple regression and a PCA.

I must start per my multiple regression and eventually delete somme individuals who has been a Cook's distance > at the limit. And after I do my PCA on " new data base"

OR I must start per my PCA on my complete database, and after I do my multiple regression.

In conclusion, I must do :

- PCA 
- multiple Regression

or

-multiple Regression
-PCA

Ty for helping me !

1

There are 1 best solutions below

0
On

Regression should be the final step, not the first one. By using PCA you can reduce dimension (i.e., number of explanatory variables) by discarding "unimportant" (that is, with small variance) variables. You can use PCA to perform whitening, i.e., eliminating autocorrelation or heteroscedsticity (inhomogeneous variance) in your data (or future model's residuals). Note that if you are interested in point prediction or R square, regressing on the original features yield the same results as on the principal components. Namely, you should use the PCA for further reduction and tiding of your data (if possible), and not just for the sake of doing PCA itself.