I was trying to do exercise 4.14 from Bishop's Pattern Recognition book and it read as follow:
Show that for a linearly seperable data set, the maximum likelihood solution for logistic regression model is obtained by finding the vector $w$ whose decision boundary $w^\top \phi(x) = 0$ separates the classes and then taking the magnitude of $w$ to infinity.
first before I attempted it, I wasn't quite sure what the question exactly wanted to show. Do we need to show that Gradient Descent (GD) would increase the weights forever in this case (to infinity)? Or what do we need to show? Or do we need to show that there is some $w$ that separates the data and that its ok to take its magnitude to infinity?
Intuitively, hyperplanes just check the sign of the data, so of course we can increase the $w$ arbitrarily without changing the sign...but thats true wether the data is separable or not, so I am not sure whats special if the data is actually separable.
What I find particularly confusing is that the logistic loss (cross entropy loss):
$$ l(y, w^\top x) = log(1 + e^{-y w^\top x}) = - \sum^{classes}_{c} y_c \log \left[ softmax( h_{w}(x) ) \right]$$
is never zero. So according to me, if thats true, then logistic regression should always be able to be updatable via gradient descent forever for every data set. Is this true?
Essentially what seems to me is that the total train error:
$$J_{train}(w) = \frac{1}{N} \sum^N_{n=1}-\log(1+e^{-y^{(n)} w^\top x^{*n)}} )$$
has a limit point equal to zero (always for any data set) but that the limit point of $w_t$ is infinity always. Is this correct? This seems super counter intuitive because I also know the loss function is $$convex**, so I'd expect every local min to be a global min which seems confusing. Does this mean that convex functions can sometimes not have unique minimizers?
Anyway, does anyone have answers to my questions and a solution to bishops exercise?
Note that I did search in his soln manual but that soln is missing!
Your intuition about the separating hyperplane is exactly correct.
From the math perspective, consider the fact that if the data is separable then $y\cdot w^Tx \ge 0$ for every data point $(x,y)$. We can rewrite this as $y ||w||_2||x||_2 \cos \theta \ge 0$ where $\theta$ is the angle between the vectors $x$ and $w$. In order to minimize $\log(1+ e^{-yx^Tw})$ we want the exponent to be as negative as possible (i.e. we want $y\cdot||w||_2||x||_2 \cos \theta$ as large and positive as possible).
By separability of the data we know there is some vector $w$ such that the exponent is positive for every data point. How do we make $y\cdot||w||_2||x||_2 \cos \theta$ as large and positive as possible? We can't increase change $x$ or $y$ since they are data. We already found $\theta$ since we have a hyperplane separating the data. In order to increase it, we simply increase $||w||_2$ without bound.