Stochastic/sequential gradient descent for the perceptron

54 Views Asked by At

I'll quote Bishop's Pattern Recognition and Machine Learning (2006) in the context of the perceptron, an algorithm that, given two finite subsets of "patterns" $\mathcal{C}_1,\mathcal{C}_2\subseteq \mathbb{R}^D$, finds a hyperplane parametrized by $y(\mathbf{x}):=\mathbf{w}^T\mathbf{x}+w_0=0$ that separates $\mathcal{C}_1$ and $\mathcal{C}_2$ (when possible):

[...] The perceptron criterion [error function to be minimized] is therefore given by \begin{equation}\tag{4.54} E_P(\mathbf{w}) = -\sum_{n\in\mathcal{M}}\mathbf{w}^T\phi_n t_n \end{equation} where $\mathcal{M}$ denotes the set of misclassified patterns. [...] We now apply the stochastic [or sequential] gradient descent algorithm to this error function. The change in weight [$\mathbf{w}$] is then given by \begin{equation}\tag{4.55} \mathbf{w}^{(\tau +1)}=\mathbf{w}^\tau - \eta\nabla E_P(\mathbf{w})=\mathbf{w}^\tau +\eta\phi_n t_n \end{equation} where $\eta$ is the learning rate parameter and $\tau$ is an integer that indexes the step of the algorithm. [...] Note that, as the weight vector evolves during training, the set of patterns that are missclassified will change.

How does $\nabla E_P(\mathbf{w}) = -\phi_n t_n$?

As Bishop mentions, $\mathcal{M}$ depends on the choice of $\mathbf{w}$ so the gradient doesn't commute with $\sum_{n\in \mathcal{M}}$. Furthermore, what does $n$ mean in this context? $n$, in the sum that defines $E_P(\mathbf{w})$, takes the role of an indexation variable so I don't see how one could then use it outside of the sum.

1

There are 1 best solutions below

2
On BEST ANSWER

Equation 4.55 is very misleading. The paragraph after your excerpt does a better job of explaining the author's intent.

The perceptron learning algorithm has a simple interpretation, as follows. We cycle through the training patterns in turn, and for each pattern $x_n$ we evaluate the perceptron function (4.52). If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class $\mathcal{C}_1$ we add the vector $\phi(x_n)$ onto the current estimate of weight vector $\mathbf{w}$ while for class $\mathcal{C}_2$ we subtract the vector $\phi(x_n)$ from $\mathbf{w}$.

In stochastic gradient descent, instead of using the gradient of the entire error function, you choose one of the data points $x_n$ evaluate the gradient of the individual addend of the error function corresponding to that data point. For your $E_P$ function, this becomes $$\begin{cases}0 & n \notin \mathcal{M} \\ -\phi_n t_n & n \in \mathcal{M}\end{cases}$$

So the update is essentially $\mathbf{w}^{\tau + 1} \gets \mathbf{w} - \eta \phi_n t_n \mathbf{1}_{n \in \mathcal{M}}$, which matches the plain English description above.

Equation 4.55 is misleading in two ways: (1) it suggests that it is using the full gradient $\nabla E_P$ as the step, when it is actually using the stochastic gradient method, and (2) it only applies for the case when $n \in \mathcal{M}$.