Simple Logistic Regression - how do I use real data?

505 Views Asked by At

Binomial Logistic Regression to predict probability

Confusion Point 1:

I think I'm right in saying one of the steps of Logistic Regression is to get:

$$\log(\mathrm{Odds})$$

Now take this very simple example: I want to predict whether somebody is a parent based on their age. My data set includes 7 training entries: it includes Age of people, and whether they are parents or not:

--------------------------------------    
Age   Parent?   P(parent|age)  log(Odds)
--------------------------------------
15.0  0         0              -∞
20.8  0         0              -∞
22.2  1         1              +∞
28.4  0         0              -∞
33.1  1         1              +∞
40.9  1         1              +∞
48.7  1         1              +∞
--------------------------------------

Does $\log(\mathrm{Odds})$ have to be calculated for each of the $n$ entries you have of the independent variable, $x$ (7 in this example)? My particular example only gives me $\log(\mathrm{Odds})$ values of $+\infty$ and $-\infty$. This surely cannot be correct. How can I fit a line to this? Does this mean I have to start binning data into groups? Surely, if I start binning into groups, the Age independent variable is no longer continuous---does that matter?

Confusion Point 2:

How are the coefficients, $\beta_0$ and $\beta_1$ found?

Once one has $\log(\mathrm{Odds})$ as a function of the independent variable, I think it is safe to assume to have the form:

$$y=mx+c$$ $$\log(\mathrm{Odds})=\beta_1 x + \beta_0$$

where $\beta_0$ is analagous to the y-intercept, $c$, and $\beta_1$ is analagous to the gradient, $m$. That is, after all, a major driving factor of taking the log of the Odds, right?---it makes a continuous number between $-\infty$ and $+\infty$.

Would one then minimise the squares between the observed odds and the model odds much like in Ordinary Least Square fitting (e.g., analytically, gradient descent, etc.) to find $\beta_0$ and $\beta_1$?

1

There are 1 best solutions below

8
On BEST ANSWER

1) It seems like there is some confusion between the "real" output of a logistic regression and the use of this output for a classification task. The idea behind the model is to estimate the conditional probability of some event of interest (Parent or not - in your case) given some independent variables (age), namely

$$ \hat{P}(y_i=1|x=x_i)=\frac{e^{x_i'\hat{\beta}}}{1+e^{x_i'\hat{\beta}}}, $$
is the estimated probability that the $i$th subject is a parent $\{y_i=1\}$, given his age $x_i$ and the estimated coefficients $\hat{\beta}$. As such, for every individual you get a probability $p_i\in(0,1)$, so in order to use this values for classification you have to set some cut-off $c$, such that $$ \mathcal{I}\{p_i>c\}. $$ That is, if $p_i>c$ then give the label "parent" to this individual. Thus, the $\log(odds)$ in your table computed after the dichotomization (instead of before).

2) Mostly they are the maximum likelihood estimators. Because the logistic model is non-linear model - the optimization is done numerically. The likelihood is $$ \mathcal{L}(\beta) = \prod_{i=1}^np(x_i)^{y_i}(1-p(x_i)^{1-y_i} $$
or equivalently $$ \mathcal{l}(\beta) = \log\left(\prod_{i=1}^np(x_i)^{y_i}(1-p(x_i)^{y_i-1}\right),\\ $$
after some algebra the maximization problem becomes $$ \mathcal{l}(\beta)=-\sum_{i=1}^n\log(1+e^{x_i'\beta})+\sum_{i=1}^ny_i(x_i,'\beta), $$ that have to be solved numerically (e.g., by Newton-Raphson method).


1) Lets take your example. Lets assume that you have estimated the coefficients $\beta$ (For now it doesn't matter how it works). For the sake of simplicity assume that $(\hat{\beta_0}, \hat{\beta_1})=(1,0.01)$. Now, you want to classify each subject using the following logistic model.

$$ \hat{P}(y_i=1|age)=\frac{e^{\beta_0+\beta_1age}}{1+e^{\beta_0+\beta_1age}}. $$

Lets take the first subject. His\her age is $15$, so by plugging it in the model we get $$ \hat{P}(y_i=1|age=15)=\frac{e^{1+0.01\cdot15}}{1+e^{1+0.01\cdot15}}\approx 0.76, $$ which mean that his/her probability of being parent is $0.76$, and the $\log(odds)$ are $$ \log\left(0.76/(1-0.76)\right)=1+0.01\cdot 15. $$ Now, you are interested in using this model in order to perform classification rather than compute probabilities. As such, you need to set a rule that converts probabilities into classification. The default rule (which, in a sense, assumes no prior knowledge) is: if $p(y_i|age_i) >1/2$ then give the label "parent" to subject $i$. Namely, the first one, with $p_i=0.76$ is classified as a parent. If you compute the log of the odds to the classified value $\{1,0\}$ you'll get either $\infty$ for "parent" and $-\infty$ for non-parent.