Convergence Analysis of Linear Interpolant Process in Stochastic Gradient Descent with ODE Method

23 Views Asked by At

Considering the ODE Method for Stochastic Gradient Descent: Suppose $Q$ represents a probability measure on $R^d$, denoting the data distribution, and $L: \mathbb{R}^p \times \mathbb{R}^d \rightarrow \mathbb{R}$ is the loss function, with $\mathbb{R}^p$ as the parameter space and $\mathbb{R}^d$ as the data space. Stochastic gradient descent (SGD) with a learning rate $\delta>0$ is defined as follows: We begin with initial data $X_0=x \in \mathbb{R}^p$, and we generate $\left(Y^{\ell}\right)$ as i.i.d. samples from $Q$. The iteration for $X_n^\delta$ is given by: $$ X_n=X_{n-1}-\delta \nabla L\left(X_{n-1}, Y_n\right). $$

We define the population gradient flow as the solution of the ODE: $$ \left\{\begin{array}{l} \dot{Y}_t=-\nabla \Phi(x) \\ Y_0=x \end{array}\right. $$ where $\Phi(x)=\mathbb{E} L(x, Y)$. Under what conditions on $Q$ and $L$ can we demonstrate the convergence of the linear interpolant process, resembling $Y_t^\delta \approx X_{t / \delta}^\delta$? (It's important to define the linear interpolant and possibly introduce assumptions to ensure the well-posedness of the population gradient flow, such as ensuring the 'martingale' problem is well-posed in the limit...)