The mathematics behind Adam Optimizer

1.9k Views Asked by At

I have read the paper "ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION".

The PDF link is below: https://arxiv.org/pdf/1412.6980.pdf

The section 2.1 gives the explanation and the intuition in ADAM, but the statements does not make sense for me.

At first, it claims that $| \Delta_t | \le \alpha \cdot (1-\beta_1) / \sqrt{1-\beta_2}$, if $(1-\beta_1)>\sqrt{1-\beta_2}$

In the suggested value of $\beta_1=0.9$, $\beta_2=0.999$, it is in the case.

Latter it claims that $| \Delta_t | \le \alpha$, in common scenarios, but I think that if $\beta_1$, $\beta_2$ is fixed, if would never be in this so called "common scenarios".

Did I miss something? I am really confused.

1

There are 1 best solutions below

3
On

My understanding is that those two expressions are two possible upper bounds (applicable depending the situation).

According to the paper (https://arxiv.org/pdf/1412.6980.pdf)

The first case only happens in the most severe case of sparsity: when a gradient has been zero at all timesteps except at the current timestep.

In that situation the upper bound will be:

$| \Delta_t | \le \alpha \cdot (1-\beta_1) / \sqrt{1-\beta_2} $

Otherwise the upper bound will be:

$|\Delta_t | \le \alpha$