Introduction
Here are some high-level intuitions that seem to be folklore in the optimization community:
The gradient descent method is often motivated from a physical point of view, as a 'ball rolling down a hill', or something to that effect.
This is a good high-level analogy, but once you look closer at the details of the algorithm, this point of view doesn't stand up to scrutiny. For example, while the physical picture suggest that the ball accelerates down the hill, no such thing happens in vanilla gradient descent; it is more like Aristotelian physics in that the 'force' creates constant velocity instead of constant acceleration.
- Still, there are more sophisticated variants on the theme of gradient descent which exhibit such acceleration, such as gradient descent with momentum, or damped Newton's method, and some of these are governed by differential equations resembling actual physical scenarios.
My question: has anyone written down a systematic study of physical interpretations of gradient descent and its variants? More precisely, are there ways to cast such algorithms in a physical setting (even if the physics is 'fake' in the sense that it doesn't quite match real-world physics), and are there interesting quantities (e.g. some form of energy) naturally associated with these interpretations?
During this week I have been trying to understand GD under the light of physics. As a matter of mathematical formalism, consider the following assumptions:
With the above assumptions, let's prove that our functional $E$ is conservative, if and only if $F(\mathbf{x}(t)) = -\nabla V(\mathbf{x},t)$.
Differentiating $E$ with respect to $t$ yields:
\begin{align} \dfrac{d}{dt}(\dfrac{1}{2}m|\dot{\mathbf{x}}|^{2}+V(\mathbf{x}(t))) &= m\sum \dot{\mathbf{x}}_{j}\ddot{\mathbf{x}}_{j} + \sum \dfrac{\partial V}{\partial \mathbf{x}_{j}}\dot{\mathbf{x}}_{j}\\ &= \langle\dot{x}(t);m\ddot{\mathbf{x}}(t)+\nabla V\rangle\\ &= \langle\dot{x}(t);F(\mathbf{x},t)+\nabla V\rangle \end{align}
Which turns out to be zero if and only if $F(\mathbf{x},t) = -\nabla V$.
Therefore, due to our third assumption, considering a particle in a conservative field and no external forces being applied to it, the particle will follow the direction of its conservative force, $F(\mathbf{x},t) = -\nabla V(\mathbf{x},t)$ which is exactly the path that minimizes our energy functional. Thus:
$$\mathbf{x}(t_{0}+t) = \mathbf{x}(t_{0}) + \gamma F(\mathbf{x},t_{0})$$
where $\gamma$ is a constant to make sure we are adding meters to meters. This is a continuous formulation of Gradient Descent. We thus can make the following parallel: