I’m confused about the steps to go from a simple linear regression to logistic regression.
If we have a dataset consisting of a column of x values and a column of y values (the values we want to predict), then we can run a simple linear regression to get a predictive model such that y_pred = B1x + c, where B1 is the coefficient for our inputs, x, and c is the intercept of the line.
Now let’s say y is categorical such that it is either 1 or 0. 1 if event occurs, and 0 of it does not. Many of the videos I’ve watched tell me to think of y_pred as a probability even though it’s not. If we think of it as a probability it makes no sense because, assuming the regression line has positive slope, for very large values of x we get y_pred values which can go to infinity. Also for small values of x, depending on the regression line, we may have negative predicted values. Both of those are not a way to think about probabilities so we throw linear regression out and try something else.
As a next step, they say to start calling y_pred “z” and think of a function that can take in the z values we got from our linear regression output and map them to values between 0 and 1. A sigmoid does this well and is described as P = 1/(1+e^(-z)). If we now make a new column of data, P, using all our z values, then we now have a column of data that tells us the probability of a 1 or 0 based on the independent variable. But to fit the data better we think of P = (1/e^(-z)) as ln(p/(p-1)) = z, as they are equivalent. Then we perform something called maximum likelihood estimation to get a new coefficient for x and c so the curve fits the data better... or am I wrong and you simply do a linear regression because of the linear relationship between log odds and z?
I think my confusion is this: why do we care about log odds? Why not just pass z through the sigmoid, fit it, and use that to get probabilities for varying x values? Am I just so lost that I’m going in circles with misunderstandings? Can someone help with thinking through the steps above?
In a binary classification problem, we are given a training dataset consisting of feature vectors $x_1, \ldots, x_N \in \mathbb R^d$ and corresponding labels. Let's think of the label for example $i$ as being a random variable $Y_i$ with two possible values, $0$ or $1$. Moreover, let's assume that the random variables $Y_i$ are independent and that there exists a vector $\beta^\star \in \mathbb R^{d+1}$ such that $$ P(Y_i = 1) = \sigma(\hat x_i^T \beta^\star) \quad \text{for } i = 1, \ldots, N. $$ Here $\sigma(u) = \frac{1}{1 + e^{-u}}$ is the sigmoid function and $\hat x_i$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i$.
Let $y_i$ be the observed value of the random variable $Y_i$. Notice that \begin{align} P(Y_i = y_i \text { for } i = 1, \ldots, N) &= \Pi_{i=1}^N P(Y_i = y_i) \\ &= \Pi_{i=1}^N \sigma(\hat x_i^T \beta^\star)^{y_i}(1 - \sigma(\hat x_i^T \beta^\star))^{1 - y_i}. \end{align} (Parse that last expression carefully. It gives the correct value if $y_i = 0$ and it also gives the correct value if $y_i = 1$. Admittedly, expressing $P(Y_i = y_i)$ in this way is a "slick" thing to do. It's something that you would only think of with the benefit of hindsight, after making a lot of effort to simplify this calculation.)
It seems natural to estimate $\beta^\star$ by finding the vector $\beta$ that maximizes the function $$ L(\beta) = \Pi_{i=1}^N \sigma(\hat x_i^T \beta)^{y_i} (1 - \sigma(\hat x_i^T \beta)^{1 - y_i}, $$ which is called the "likelihood function". But maximizing $L(\beta)$ is equivalent to maximizing $$ \log L(\beta) = \sum_{i=1}^N y_i \log\left(\sigma(\hat x_i^T \beta) \right) + (1 - y_i) \log \left(1 - \sigma(\hat x_i^T \beta) \right). $$ This is the objective function that we maximize when training a logistic regression model.