Binomial Logistic Regression to predict probability
Confusion Point 1:
I think I'm right in saying one of the steps of Logistic Regression is to get:
$$\log(\mathrm{Odds})$$
Now take this very simple example: I want to predict whether somebody is a parent based on their age. My data set includes 7 training entries: it includes Age of people, and whether they are parents or not:
--------------------------------------
Age Parent? P(parent|age) log(Odds)
--------------------------------------
15.0 0 0 -∞
20.8 0 0 -∞
22.2 1 1 +∞
28.4 0 0 -∞
33.1 1 1 +∞
40.9 1 1 +∞
48.7 1 1 +∞
--------------------------------------
Does $\log(\mathrm{Odds})$ have to be calculated for each of the $n$ entries you have of the independent variable, $x$ (7 in this example)? My particular example only gives me $\log(\mathrm{Odds})$ values of $+\infty$ and $-\infty$. This surely cannot be correct. How can I fit a line to this? Does this mean I have to start binning data into groups? Surely, if I start binning into groups, the Age independent variable is no longer continuous---does that matter?
Confusion Point 2:
How are the coefficients, $\beta_0$ and $\beta_1$ found?
Once one has $\log(\mathrm{Odds})$ as a function of the independent variable, I think it is safe to assume to have the form:
$$y=mx+c$$ $$\log(\mathrm{Odds})=\beta_1 x + \beta_0$$
where $\beta_0$ is analagous to the y-intercept, $c$, and $\beta_1$ is analagous to the gradient, $m$. That is, after all, a major driving factor of taking the log of the Odds, right?---it makes a continuous number between $-\infty$ and $+\infty$.
Would one then minimise the squares between the observed odds and the model odds much like in Ordinary Least Square fitting (e.g., analytically, gradient descent, etc.) to find $\beta_0$ and $\beta_1$?
1) It seems like there is some confusion between the "real" output of a logistic regression and the use of this output for a classification task. The idea behind the model is to estimate the conditional probability of some event of interest (Parent or not - in your case) given some independent variables (age), namely
$$ \hat{P}(y_i=1|x=x_i)=\frac{e^{x_i'\hat{\beta}}}{1+e^{x_i'\hat{\beta}}}, $$
is the estimated probability that the $i$th subject is a parent $\{y_i=1\}$, given his age $x_i$ and the estimated coefficients $\hat{\beta}$. As such, for every individual you get a probability $p_i\in(0,1)$, so in order to use this values for classification you have to set some cut-off $c$, such that $$ \mathcal{I}\{p_i>c\}. $$ That is, if $p_i>c$ then give the label "parent" to this individual. Thus, the $\log(odds)$ in your table computed after the dichotomization (instead of before).
2) Mostly they are the maximum likelihood estimators. Because the logistic model is non-linear model - the optimization is done numerically. The likelihood is $$ \mathcal{L}(\beta) = \prod_{i=1}^np(x_i)^{y_i}(1-p(x_i)^{1-y_i} $$
or equivalently $$ \mathcal{l}(\beta) = \log\left(\prod_{i=1}^np(x_i)^{y_i}(1-p(x_i)^{y_i-1}\right),\\ $$
after some algebra the maximization problem becomes $$ \mathcal{l}(\beta)=-\sum_{i=1}^n\log(1+e^{x_i'\beta})+\sum_{i=1}^ny_i(x_i,'\beta), $$ that have to be solved numerically (e.g., by Newton-Raphson method).
1) Lets take your example. Lets assume that you have estimated the coefficients $\beta$ (For now it doesn't matter how it works). For the sake of simplicity assume that $(\hat{\beta_0}, \hat{\beta_1})=(1,0.01)$. Now, you want to classify each subject using the following logistic model.
$$ \hat{P}(y_i=1|age)=\frac{e^{\beta_0+\beta_1age}}{1+e^{\beta_0+\beta_1age}}. $$
Lets take the first subject. His\her age is $15$, so by plugging it in the model we get $$ \hat{P}(y_i=1|age=15)=\frac{e^{1+0.01\cdot15}}{1+e^{1+0.01\cdot15}}\approx 0.76, $$ which mean that his/her probability of being parent is $0.76$, and the $\log(odds)$ are $$ \log\left(0.76/(1-0.76)\right)=1+0.01\cdot 15. $$ Now, you are interested in using this model in order to perform classification rather than compute probabilities. As such, you need to set a rule that converts probabilities into classification. The default rule (which, in a sense, assumes no prior knowledge) is: if $p(y_i|age_i) >1/2$ then give the label "parent" to subject $i$. Namely, the first one, with $p_i=0.76$ is classified as a parent. If you compute the log of the odds to the classified value $\{1,0\}$ you'll get either $\infty$ for "parent" and $-\infty$ for non-parent.