Convergence Rate and Complexity of Single SGD update for Neural Networks

249 Views Asked by At

Setting:


Fix positive integers $m_1,\dots,m_n$ an activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ be $C^1$ with Lipschitz derivative, and let $NN$ denote the set of all feed-forward neural networks for the form $$ f_{\theta}(x):= A_n\circ \sigma \dots \circ \sigma \circ A_1, $$ where $A_i(x)=M_ix+b_i$ where $M_i$ is an $m_{i+1}\times m_{i}$ matrix and $b_i$ is a vector in $\mathbb{R}^{m_{i+1}}$, and $\theta=(M_1,\dots,M_n,b_1,\dots,b_n) \in \mathbb{R}^L$.

Note: $\sigma$ is applied component-wise.

Let $l:\mathbb{R}^{m_n}\rightarrow [0,\infty)$ be a $C^1$ function with Lipschtiz gradient, let $\{x_i\}_{i=1}^N$ be elements of $\mathbb{R}^{m_1}$, and let $$ L(\theta)\triangleq \frac1{N}\sum_{i=1}^N \ell(f_{\theta}(x_i)) . $$


Question:

Suppose we want to optimize $L(\theta)$ over $\mathbb{R}^L$ using (resp. mini-batch) stochastic gradient descent (SGD). Then I have the following question:

  1. What is the complexity of computing a single gradient update? This is probably a function of the data size, no?

  2. Is the convergence rate of the (resp. mini-batch) SGD a function of $N$ (and/or the batch-size $b$)?

Intuitively one of these two must be a function of the data-size (and the mini-batch size).

I'm looking for non-empirical result and I don't want to assume that $\sigma$ is an affine map (like in the classical MLP).

1

There are 1 best solutions below

2
On BEST ANSWER
  1. Each SGD iteration involves computing $\nabla_{\theta} \ell(f_{\theta}(x_j)) = d\ell(f_{\theta}(x_j)) \nabla_{\theta}f_{\theta}(x_j)$ for some (randomly chosen) index $j \in \{1,\cdots,N\}$. Therefore, the cost of this step equals the cost of evaluating $f_{\theta}(x_j)$ and its gradient. The cost of evaluating the function $f_{\theta}(x_j)$ is $O(m_1m_2 + m_2m_3 + \cdots + m_{n}m_{n+1})$. To see this, note that evaluating $\sigma \circ A_1(x)$ involves multiplying a matrix of dimension $m_2 \times m_1$ with a vector of dimension $m_1$, then adding two vectors of dimension $m_2$, and finally composing with the function $\sigma$ a total of $m_2$ times. During this computation, we can store the intermediate values $\sigma \circ A_1(x), \sigma \circ A_2(\sigma \circ A_1(x))$, and so on. From automatic differentiation, we know that the cost of gradient evaluation is a constant factor times this quantity, see Theorem 4.1 in this reference for example. If the minibatch size is $B$, then the per-iteration cost of minibatch SGD is $B$ times this quantity (assuming that gradients and function values are not reused even if duplicate choices of $j$ are present in the minibatch; even otherwise, the cost is roughly the same on average).
  2. Let $g_i(\theta) := \nabla_{\theta} \ell(f_{\theta}(x_i))$. Then, the gradient at $\theta$ equals $\frac{1}{N}\sum_{i=1}^{N} g_i(\theta)$. From Corollary 2.2 of this paper, we have that the "level of suboptimality" (measured in term of the norm of the gradient for nonconvex problems) after iteration $T$ is $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{T}}\right)$ where $\sigma > 0$ is the standard deviation of the stochastic gradient (sorry for the clash in notation). From basic statistics, we know that the variance of $g_j(\theta)$ for $j$ chosen uniformly at random from $\{1,\cdots,N\}$ does not depend on the number of samples $N$. In other words, if we use a random draw from a probability distribution to estimate its mean, then the variance of this single draw-based sample mean does not depend on the size of the population. Therefore, the rate of convergence of the stochastic gradient method does not depend on the sample size $N$. If we use a minibatch of size $B$ of i.i.d. draws, then the standard deviation shrinks by a factor of $\sqrt{B}$ (again from basic statistics on the sample mean), so the rate of convergence now becomes $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{BT}}\right)$. Note that if we use a full gradient evaluation, then $\sigma = 0$ and we get the faster rate of convergence $O\left(\dfrac{1}{T}\right)$. Note that in some cases, we don't even have this difference between stochastic and deterministic methods, see the analyses in Sections 3.3 and 3.4 here, for instance.