Coefficient for the gradient term in stochastic gradient descent (SGD) with momentum

47 Views Asked by At

I'm studying SGD with momentum and have come across two versions of the update formula.

The first is from a wiki same as from the original paper:

$$ \Delta w^t = \alpha * \Delta w^{t-1} - lr * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + \Delta w^t $$ where $w$ stands for weights; lr stands for learning rate and $\nabla L(w)$ stands for drivatives of loss function.

The second version is more common:

$$ \Delta w^t = \alpha * \Delta w^{t-1} - (1-\alpha) * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + lr*\Delta w^t $$

My question is: why must the coefficient for the $\nabla L(w)$ term be $(1- \alpha)$? It seems to me that even if $lr \neq (1- \alpha)$, this would still make sense.
Is there any specific reason to choose this coefficient?

This question is originall post on StackOverflow.