Stochastic/sequential gradient descent for the perceptron

Question

Stochastic/sequential gradient descent for the perceptron

54 Views Asked by Bumbble Comm At 26 Mar 2026 - 10:19

I'll quote Bishop's Pattern Recognition and Machine Learning (2006) in the context of the perceptron, an algorithm that, given two finite subsets of "patterns" $\mathcal{C}_1,\mathcal{C}_2\subseteq \mathbb{R}^D$, finds a hyperplane parametrized by $y(\mathbf{x}):=\mathbf{w}^T\mathbf{x}+w_0=0$ that separates $\mathcal{C}_1$ and $\mathcal{C}_2$ (when possible):

[...] The perceptron criterion [error function to be minimized] is therefore given by \begin{equation}\tag{4.54} E_P(\mathbf{w}) = -\sum_{n\in\mathcal{M}}\mathbf{w}^T\phi_n t_n \end{equation} where $\mathcal{M}$ denotes the set of misclassified patterns. [...] We now apply the stochastic [or sequential] gradient descent algorithm to this error function. The change in weight [$\mathbf{w}$] is then given by \begin{equation}\tag{4.55} \mathbf{w}^{(\tau +1)}=\mathbf{w}^\tau - \eta\nabla E_P(\mathbf{w})=\mathbf{w}^\tau +\eta\phi_n t_n \end{equation} where $\eta$ is the learning rate parameter and $\tau$ is an integer that indexes the step of the algorithm. [...] Note that, as the weight vector evolves during training, the set of patterns that are missclassified will change.

How does $\nabla E_P(\mathbf{w}) = -\phi_n t_n$?

As Bishop mentions, $\mathcal{M}$ depends on the choice of $\mathbf{w}$ so the gradient doesn't commute with $\sum_{n\in \mathcal{M}}$. Furthermore, what does $n$ mean in this context? $n$, in the sum that defines $E_P(\mathbf{w})$, takes the role of an indexation variable so I don't see how one could then use it outside of the sum.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Equation 4.55 is very misleading. The paragraph after your excerpt does a better job of explaining the author's intent.

The perceptron learning algorithm has a simple interpretation, as follows. We cycle through the training patterns in turn, and for each pattern $x_n$ we evaluate the perceptron function (4.52). If the pattern is correctly classified, then the weight vector remains unchanged, whereas if it is incorrectly classified, then for class $\mathcal{C}_1$ we add the vector $\phi(x_n)$ onto the current estimate of weight vector $\mathbf{w}$ while for class $\mathcal{C}_2$ we subtract the vector $\phi(x_n)$ from $\mathbf{w}$.

In stochastic gradient descent, instead of using the gradient of the entire error function, you choose one of the data points $x_n$ evaluate the gradient of the individual addend of the error function corresponding to that data point. For your $E_P$ function, this becomes $$\begin{cases}0 & n \notin \mathcal{M} \\ -\phi_n t_n & n \in \mathcal{M}\end{cases}$$

So the update is essentially $\mathbf{w}^{\tau + 1} \gets \mathbf{w} - \eta \phi_n t_n \mathbf{1}_{n \in \mathcal{M}}$, which matches the plain English description above.

Equation 4.55 is misleading in two ways: (1) it suggests that it is using the full gradient $\nabla E_P$ as the step, when it is actually using the stochastic gradient method, and (2) it only applies for the case when $n \in \mathcal{M}$.

Stochastic/sequential gradient descent for the perceptron

There are 1 best solutions below

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions