Batch vs incremental gradient descent

Question

Batch vs incremental gradient descent

2.8k Views Asked by Bumbble Comm At 07 Apr 2026 - 9:35

I am studying Machine Learning, but I believe you guys should be able to help me with this!

Basically, we have given a set of training data $\{(x_1,y_1), x(x_2,y_2), ..., (x_n, y_n)\}$, and we need to train a perceptron to fit the data as best as possible. A perceptron here, is a simple model consisting of a weight vector $w$, and the perceptron outputs $w^Tx$ for an input $x$.

We define an error function $ E(w) = \frac{1}{2N} \sum_{i=1}^{N}(t_d - o_d)^2$, where $t_d - o_d$ is simply the difference between the ideal target value, and our output, $w^Tx$.

A way to minimize the error is to compute the gradient, and we obtain

$\frac{\partial E}{\partial w_i} = \frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id})$

Now, in a computer algorithm, I can, on each iteration, update each $w_i$ to $w_i + \eta \frac{\partial E}{\partial w_i}$, but the problem is that computing that is slow, as it ranges over the whole training set of data.

So, what has been invented, is the LMS (Least Mean Squares) rule, that claims that

$\frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id}) \approx (t_d - o_d)(-x_{id}) $

which means I can just use the current training example to perform my gradient descent.

Now, after this intro, I would like to ask for a bit more intuition and formality behind this LMS rule, and why it is a good enough approximation. I guess I would like a bit more explanation on the $\approx$ part of the above equation, and when, and how much it holds. Thanks for any help.

Original Q&A

There are 3 best solutions below

**Bumbble Comm** · Answer 1 · 2012-03-21 22:31:51

I learned that the perceptron only works on one example at a time and the update is not done based on gradient decent, but simply by error. Basically if you guess the example label's correctly you do nothing, if you guess incorrectly you update the weight for the parameters paired with the wrong label down one and the correct label up one, where one is your step size.

What you just described sounds like what I learned as logistic regression.

Here is the literature that I learned it from.

http://www.cc.gatech.edu/~jeisenst/classes/cs7650_sp12/lec2.pdf

I have code for the perceptron in python if that would be useful, but its pretty ugly and undocumented. There is probably better open source code around.

However all of this could possibly depend on the semantics of your class, so take it with a grain of salt.

PS. for a formal reason why its not that bad to assume that the summation is approximately equal to any given sample you should probably look at statistics and the law of big samples. The more samples you take from the distribution the greater the probability is that the samples will represent your original distribution. If this route interests you I can post a paper where bayes models are trained with montey carlo techniques.

**Bumbble Comm** · Answer 2 · 2012-03-21 22:45:52

One thing you're missing is that typically perceptrons are formulated as binary classifiers. There is typically a threshold on $w^Tx$, e.g. $\mathrm{sign}(w^Tx)$, whereby $t_d$, $o_d$ are 1 or -1 (or equivalently 0 or 1 if you use $\mathbb{I}(w^Tx > 0)$; it effectively works out the same).

The short answer is that it's not a great approximation, in an absolute sense. It's guaranteed to converge to some weight vector that yields zero classification error, if any such vector exists. There are no guarantees about how long it will take you to get there (and there's no guarantee that any single step will always make your error rate go down), and in general such methods (this is an instance of a more generally applicable method known as "stochastic gradient descent") have terrible asymptotic convergence properties (but are still used for other reasons).

The notes from Geoff Hinton's undergrad course have some helpful insight on the matter (with the necessary SVM-bashing). If you want a formal proof just Google for "perceptron convergence" "proof". It follows from the fact that if the classes are linearly separable then the set of weight vectors that get zero training error (there will typically be infinitely many) form a convex region in weight space.

**Bumbble Comm** · Answer 3 · 2012-11-26 09:06:25

Here's my guess about what's going on. We would like to choose $w$ to minimize \begin{equation} C(w) = \frac12 \mathbb E \left[(w^T x - y)^2 \right]. \end{equation} To minimize $C$ by gradient descent, we would need to be able to evaluate \begin{equation} \nabla C(w) = \mathbb E \left[(w^T x - y)x \right]. \end{equation} Unfortunately, we don't know $\mathbb E \left[(w^T x - y)x \right] \, \,$ , but we can approximate it using our training data $\{(x_i,y_i) | 1 \leq i \leq N\}$.

The most obvious approximation to use is \begin{equation} \mathbb E \left[(w^T x - y)x \right] \approx \frac{1}{N} \sum_{i=1}^N (w^T x_i - y_i) x_i. \end{equation} However, that approximation is expensive when $N$ is huge, and we might like to update $w$ immediately as soon as new training data arrives, without having to redo the whole computation. So, a cheaper approximation is \begin{equation} \mathbb E \left[(w^T x - y)x \right] \approx (w^T x_i - y_i) x_i \end{equation} where $(x_i,y_i)$ is one particular training vector (and label), perhaps the one that arrived most recently.

This approximation isn't as good, but it's less expensive.

Batch vs incremental gradient descent

There are 3 best solutions below

Related Questions in OPTIMIZATION

Related Questions in CONVEX-OPTIMIZATION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions