When applying KS-test (as goodness-of-fit test) on logistic regression (class: 0,1), where x-axis = probability of being classified as class 1, sorting ascendingly. Here are the 2 questions:
1. Why are the 2 curves to plot are TPR, FPR? AFAK, usually two curves should be the cdf of 2 classes w.r.t. different thresholds on x-axis?
2. Why is the ks-value = max(TPR-FPR)? According to (1), these two rates would ignore the counts for TP and TN, right? If max(TPR-FPR) stands, what's the proof and derivation behind?
Troubled quite a while as there's limited explanation from googling...any help please?? Thanks in advance!
K-S Test is generally used to measure equality of two distributions by comparing their CDFs. A low K-S value implies the distributions are equal. However, intuitively it can also be used to decide if two distributions are different i.e. if the K-S test results in a high score, we can say that the distributions are distinguishable.
For testing goodness of fit for logistic regression, K-S test is done on TPR and FPR. The main idea is to achieve large separation of these two curves. We can then pick the probability threshold which corresponds to the maximum separability. If the model is ideal, its K-S value will be equal to 1. $$KS_{value} = max(TPR-FPR)$$
Please refer to this descriptive image.