You have trained a logistic regression model for a classification task using a 80-20 train-test split (randomly sampled) on a dataset of 10,000

211 Views Asked by At

You have trained a logistic regression model for a classification task using a 80-20 train-test split (randomly sampled) on a dataset of 10,000 observations. The following metrics are produced from the test at predictions and labels:

Label Precision Recall
0 0.91 0.94
1 0.10 0.07

a. The classifier is overfitting on the training set.

b. The dataset is imbalanced between classes.

c. The classifier does not have high accuracy.

d. Logistic regression is not an appropriate model choice for this classification problem.

===================================================

I was solving this question and was not sure which one is the answer. I do know that a) and c) are not correct, but am not sure if it is b) or d). Explanation would be appreciated as well!

Could someone help me with this, please? Thank you!

1

There are 1 best solutions below

2
On BEST ANSWER

With this set-up, you have to think that you have the table results out of 2k samples, so the results are pretty reliable with respect to a test set of 200 samples. Now having that in mind:

a) With overfitting, you expect to have bad metrics in your test set. Therefore, if that was the case, class 0 would as well have lower $P$ and/or $R$.

b) The most likely option I have seen happen many times. It can be that not so many class 1 instances ended up at the training set and therefore are simply seen as noise by the algo. At test time, you have many more of them but the algo doesn't know how to recognize them and keeps seeing them as noise. It can also happen that none is on the training set and this reflects an uneven distribution of the classes out of sample, and then the algo cannot even classify them.

c) Based on those numbers, you can see that you have two ratios: $\frac{TP}{FP}=\frac{91}{9}$ and $\frac{TP}{FN}=\frac{94}{6}$. Say you have $TP=100$. This means you have $FP\approx10$, $FN\approx6$ and $TN\approx1884$. Your accuracy will then be $A=\frac{1984}{2000}=0.992$. If you instead give more weight to the true positives, say $TP=1800$ and you crunch the numbers, you will still get very high accuracy.

d) Logistic regression can be a perfectly fine model when the data is either balanced or imbalanced. It depends on the features you plug in: if the features are bad, you would not expect to have high values in any of the cells in your table since a noisy dataset follows the "garbage in - garbage out" principle. There can be other reasons (e.g. bad sampling strategy) that can make the results you see. That is why it is advised to use cross-validation methods.