I'm studying SGD with momentum and have come across two versions of the update formula.
The first is from a wiki same as from the original paper:
$$ \Delta w^t = \alpha * \Delta w^{t-1} - lr * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + \Delta w^t $$ where $w$ stands for weights; lr stands for learning rate and $\nabla L(w)$ stands for drivatives of loss function.
The second version is more common:
$$ \Delta w^t = \alpha * \Delta w^{t-1} - (1-\alpha) * \nabla L(w^{t-1}) \\ w^t = w^{t-1} + lr*\Delta w^t $$
My question is: why must the coefficient for the $\nabla L(w)$ term be $(1- \alpha)$?
It seems to me that even if $lr \neq (1- \alpha)$, this would still make sense.
Is there any specific reason to choose this coefficient?
This question is originall post on StackOverflow.