Advantage of Exponentiated Gradient over Gradient Descent

65 Views Asked by At

Suppose I am trying to fit a function $y(x) = \sum_{n} c_n f_n(x)$, where the set $f_n(x)$ are "expert predictions". At each time step I receive one evaluated value $y(x_t)$, which I use to update my weights $c_n$. It can be shown (Kivinen and Warmuth, 1997), this can be done in a manner which produces an error which, up to a sublinear additive term, matches that obtained via standard regression.

Two ways to do this are:

  1. Gradient Descent: $c_n \leftarrow c_n +2\eta\left[ y(x_t) - \sum_n c_n f_n(x_t) \right]f_n(x_t)$,

  2. Exponentiated Gradient: $c_n^\pm \leftarrow c_n^\pm\exp\left(\mp 2\eta\left[ y(x_t) - \sum_n c_n f_n(x_t) \right]f_n(x_t)\right)$, then normalise $\sum_{n} (c_n^+ + c_n^-) = 1$.

In the second case, we then take $c_n = (c_n^+ - c_n^-)$.

My question is: When is is appropriate to choose one approach over the other? The derivation for the second method looks to minimise the Relative Entropy between update steps, so I understand why that might be useful when $c_n$ must form a probability distribution. Is that the only situation one might want to use the second approach?

Thank you in advance!