I want to build a predictive model, where given a few numeric explanatory variable n1, n2, n3, it would predict a boolean response variable r1. I see a relationship between n1 and r1. As n1 goes up, the probability of r1 being true also goes up. For example, say the following is a representative subset of my training data (ignoring the other explanatory variables n2 and n3).
n1 r1
1 False
10 False
25 False
30 False
37 True
46 True
48 False
52 False
55 True
57 False
60 True
62 False
70 True
80 True
90 True
99 True
It seems like it'd make sense to perform logistic regression to build the predictive model, with n1 as the explanatory variable and r1 as the response variable. n1 could be a useful predictor - as it goes up from 1 to 99, the probability of r1 being true increases.
The problem is that out of my dataset, n1 of 0 does not follow the same relationship. If my model is valid, then n1 of 0 should be a strong indication that r1 is False. As it turns out from the training data, n1 of 0 has no predictive power over r1. For example, say n1 is a measurement of some kind of rate. n1 being 0 could indicate an absence of this measurement.
n1 r1
0 False
0 True
0 False
0 True
0 False
0 True
What is best way to approach this? I feel like just throwing n1 into logistic regression is not a good idea.
One way is to add a new categorical variable - valid_n1, where it is True if n1 is greater than 0, and False otherwise. I could add this new variable into the regression.
Is this a good approach? Should I go back and modify the original n1 field?
EDIT
I would prefer not to remove all entries where n1 is 0, since there is a large number of them.