Logistic regression of large dataset

673 Views Asked by At

I need to build a logistic regression model to do some predictions. However, the dataset is very large, consisting of about 500,000 rows.

For example, if we are going to build a model on whether they are rich people or not, then we code yes as 1 and no as 0 to create a binary response variable and independent variables are gender, age and whether they obtained MBA degree or not. So in this large data set, we have known certainly that 500 people are rich (say they have net worth more than 1 million), and the rest of them are unknown. If I built a logistic regression using this dataset, and try to find those predicted as 1(rich people) but actually they are 0(non-rich people) in the given dataset. Is it correct to use the whole dataset to build a model and use that model to predict the whole group and set the cutoff value very high to identify those people who are actually rich(predicted as 1) but shown as non-rich(0 in the data set)?

Any hints on how to deal with a large dataset when building a logistic regression model, and especially when you only have few response variables marked as 1?