RMSprop is a method for preventing oscillation that could potentially occur in learning. For example, see the figure below (taken from Andrew Ng's Coursera lecture):
In this figure, the global optima resides at the red dot, and we have two parameters W and b to tune. In this case, learning on b oscillates, causing slow learning on W which is necessary in order to reach the goal.
RMSprop addresses this oscillating problem by, as I understand it, normalizing the weights:
$S_{dW} = \beta S_{dW} + (1 - \beta)dW^2$
$S_{db} = \beta S_{db} + (1 - \beta)db^2$
$W = W - \alpha\frac{dW}{\sqrt{S_{dW}}}$
$b = b - \alpha\frac{db}{\sqrt{S_{db}}}$
The intuition here is that $S_{dW}$ and $S_{db}$ contain the exponentially weighted averages of the magnitude of $dW$ and $db$ over time, and we use that value to scale up or down the values of $dW$ and $db$ when updating $W$ and $b$. In this example, we see that we want to reduce $db$ as that causes oscillation, which we achieve by dividing $db$ by $\sqrt{S_{db}}$ (since we squared $db$), which is a large value relative to $db$. Similarly, we want to increase $dW$ as that leads us closer to the goal, which is achieved by dividing $dW$ by $\sqrt{S_{dW}}$, which is a small value relative to $dW$. So far so good.
However, what if we had the situation where we did not have oscillation. Let us assume that we again the figure above but now we do not have any oscillation on $b$ (meaning $db$ is close to but not exactly 0) and fast learning on $W$ (meaning $dW$ is relatively large). In this situation the value of $db$ and $dW$ are already good, but wouldn't RMSprop then increase $db$ (since $\sqrt{S_{db}}$ becomes a relatively small value) and reduce $dW$ (since $\sqrt{S_{dW}}$ becomes a relatively large value), thereby inadvertently introducing oscillation? Or have I missed something?
It feels like RMSprop is making the assumption that very small or very large derivates is always bad, whereas reasonably sized derivates are always good. Is my thinking correct here?
