How does one show that the likelihood solution for logistic regression has a magnitude of infinity for separable data (Bishop exercise 4.14)?

2.7k Views Asked by At

I was trying to do exercise 4.14 from Bishop's Pattern Recognition book and it read as follow:

Show that for a linearly seperable data set, the maximum likelihood solution for logistic regression model is obtained by finding the vector $w$ whose decision boundary $w^\top \phi(x) = 0$ separates the classes and then taking the magnitude of $w$ to infinity.

first before I attempted it, I wasn't quite sure what the question exactly wanted to show. Do we need to show that Gradient Descent (GD) would increase the weights forever in this case (to infinity)? Or what do we need to show? Or do we need to show that there is some $w$ that separates the data and that its ok to take its magnitude to infinity?

Intuitively, hyperplanes just check the sign of the data, so of course we can increase the $w$ arbitrarily without changing the sign...but thats true wether the data is separable or not, so I am not sure whats special if the data is actually separable.

What I find particularly confusing is that the logistic loss (cross entropy loss):

$$ l(y, w^\top x) = log(1 + e^{-y w^\top x}) = - \sum^{classes}_{c} y_c \log \left[ softmax( h_{w}(x) ) \right]$$

is never zero. So according to me, if thats true, then logistic regression should always be able to be updatable via gradient descent forever for every data set. Is this true?

Essentially what seems to me is that the total train error:

$$J_{train}(w) = \frac{1}{N} \sum^N_{n=1}-\log(1+e^{-y^{(n)} w^\top x^{*n)}} )$$

has a limit point equal to zero (always for any data set) but that the limit point of $w_t$ is infinity always. Is this correct? This seems super counter intuitive because I also know the loss function is $$convex**, so I'd expect every local min to be a global min which seems confusing. Does this mean that convex functions can sometimes not have unique minimizers?

Anyway, does anyone have answers to my questions and a solution to bishops exercise?


Note that I did search in his soln manual but that soln is missing!

http://read.pudn.com/downloads157/ebook/699944/Pattern%20Recognition%20and%20Machine%20Learning%20(Solution%20Manual)%20-%20Bishop.pdf

2

There are 2 best solutions below

2
On

Your intuition about the separating hyperplane is exactly correct.

From the math perspective, consider the fact that if the data is separable then $y\cdot w^Tx \ge 0$ for every data point $(x,y)$. We can rewrite this as $y ||w||_2||x||_2 \cos \theta \ge 0$ where $\theta$ is the angle between the vectors $x$ and $w$. In order to minimize $\log(1+ e^{-yx^Tw})$ we want the exponent to be as negative as possible (i.e. we want $y\cdot||w||_2||x||_2 \cos \theta$ as large and positive as possible).

By separability of the data we know there is some vector $w$ such that the exponent is positive for every data point. How do we make $y\cdot||w||_2||x||_2 \cos \theta$ as large and positive as possible? We can't increase change $x$ or $y$ since they are data. We already found $\theta$ since we have a hyperplane separating the data. In order to increase it, we simply increase $||w||_2$ without bound.

1
On

The exercise is asking the reader to prove the following iff relation:

Given a linearly separable data set $\{\phi_n,t_n\}$, ${\bf w}$ is mle $\Leftrightarrow$ separate & $|{\bf w}|\to\infty$.

Next I prove it. In Bishop's book, most exercises are closely related to the text, which are usually marked in the margin. This exercise, exercise 4.14, is no exception. It shows up in page 206 of section 4.3.2, so I will use that part of the text as the context of the proof.

Some notations and settings. The context restricts the problem to two-class classification. The linearly separable data set is $\{\phi_n,t_n\}$, where $t_n\in\{0,1\}$ and ${\bf\phi}_n={\bf\phi}({\bf x}_n)$, with $n=1,\ldots, N$. As specified in section 4.2.2, $t_n=1$ denotes class $\mathcal C_1$ and $t_n=0$ denotes class $\mathcal C_2$. In logistic regression model, the posterior probability of class $\mathcal C_1$ is a logistic sigmoid $$p(\mathcal C_1|{\bf\phi})=y({\bf\phi})=\sigma({\bf w}^T{\bf\phi}),\tag{4.87}$$ so the target variable $t_n$ conforms to Bernoulli distribution (2.2) conditioned on model parameter $\bf w$ as well as input ${\bf\phi}_n$ $$p(t_n|{\bf w})=y_n^{t_n}(1-y_n)^{1-t_n},$$ where the input is dropped as a notational convention (page 141). $y_n=p(\mathcal C_1|{\bf\phi}_n)=\sigma({\bf w}^T{\bf\phi}_n)$ is an estimate by the model of the probability that the sample belongs to class $\mathcal C_1$. From this probability the likelihood function can be written $$p({\bf t}|{\bf w})=\prod\limits_{n = 1}^Ny_n^{t_n}(1-y_n)^{1-t_n}\tag{4.89}$$ where ${\bf t}=(t_1,\ldots,t_N)^T$.

As usual, maximization of the likelihood function is equivalent to minimizing a properly defined error function, here the cross-entropy error function in the form $$E({\bf w})=-\ln p({\bf t}|{\bf w})=-\sum\limits_{n = 1}^N\{t_n\ln y_n+(1-t_n)\ln(1-y_n)\}.\tag{4.90}$$

According to the decision theory introduced in page 39 of section 1.5, we decide class based on which class has a larger posterior probability. So the decision boundary is determined by $\sigma=0.5$, or equivalently ${\bf w}^T{\bf\phi}=0$. If ${\bf w}^T{\bf\phi}>0$, $\sigma>0.5$, $p(\mathcal C_1|{\bf\phi})>p(\mathcal C_2|{\bf\phi})$, we decide that the new value is in $\mathcal C_1$. Similarly, if ${\bf w}^T{\bf\phi}<0$, we classify the new value as $\mathcal C_2$ because this class enjoys larger posterior probability.

The proof has nothing to do with concrete optimization algorithms like Gradient Descent; we only need to think about the minimization theoretically. When ${\bf w}$ is a mle, it by definition achieves the minimum of the right side of (4.90). As a result, under this ${\bf w}$, we must make classification decisions correctly. That is, for samples of class $\mathcal C_1$ for which $t_n=1$, we must have ${\bf w}^T{\bf\phi}_n>0$ or $y_n>0.5$. For the remaining samples which must belong to $\mathcal C_2$ for which $t_n=0$, we must have ${\bf w}^T{\bf\phi}_n<0$ or $y_n<0.5$. Otherwise, the error would only increase, violating the definition of mle. To see this, simply note that $y_n$, as a value of logistic sigmoid function, falls in interval $(0,1)$. If a sample is classified incorrectly, say $t_n=1$ but ${\bf w}^T\phi_n<0$ for some $n$, we have $y_n$ very close to $0$, resulting in a very large value of $-\ln y_n$ contributing a large error. However, since the data set is known to be linearly separable, by its definition there is a ${\bf w}$ that makes ${\bf w}^T\phi_n>0$ so that the model can make correct decision, which reduces the error term $-\ln y_n$ since this $y_n$ is close to $1$ and therefore violates the definition of mle, a contradiction. The case of $t_n=0$ can be analyzed likewise. So the decision boundary ${\bf w}^T{\bf\phi}({\bf x})=0$ separates the classes (i.e., make classification decisions correctly).

As for the second part of necessary condition, namely $|{\bf w}|\to\infty$, we can show it by observing that ${\bf w}^T{\bf\phi}_n$ can't be zero when $\bf w$ is mle, because otherwise we could easily find another $\bf w$ making ${\bf w}^T{\bf\phi}_n$ unequal to $0$ and achieves smaller error. When ${\bf w}^T{\bf\phi}_n$ is nonzero (it can be either positive or negative), its absolute value increases as the magnitude of $\bf w$ increases, because ${\bf\phi}_n$ is constant. From the shape of logistic sigmoid we can see that the consequence is that both log terms in (4.90) vanishes if and only if $|{\bf w}|\to\infty$, leading to the smallest possible value (0) of the error.

All the above argument can be reversed, thus proved the iff statement of the exercise.