Coefficient for the gradient term in stochastic gradient descent (SGD) with momentum

47 Views Asked by Bumbble Comm At 26 Mar 2026 - 10:13

I'm studying SGD with momentum and have come across two versions of the update formula.

The first is from a wiki same as from the original paper:

$$ \Delta w^t = \alpha * \Delta w^{t-1} - lr * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + \Delta w^t $$ where $w$ stands for weights; lr stands for learning rate and $\nabla L(w)$ stands for drivatives of loss function.

The second version is more common:

$$ \Delta w^t = \alpha * \Delta w^{t-1} - (1-\alpha) * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + lr*\Delta w^t $$

My question is: why must the coefficient for the $\nabla L(w)$ term be $(1- \alpha)$? It seems to me that even if $lr \neq (1- \alpha)$, this would still make sense.
Is there any specific reason to choose this coefficient?

This question is originall post on StackOverflow.

Original Q&A

Coefficient for the gradient term in stochastic gradient descent (SGD) with momentum

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions