Convergence Rate and Complexity of Single SGD update for Neural Networks

Question

Convergence Rate and Complexity of Single SGD update for Neural Networks

249 Views Asked by user683848 At 26 Mar 2026 - 1:10

Setting:

Fix positive integers $m_1,\dots,m_n$ an activation function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ be $C^1$ with Lipschitz derivative, and let $NN$ denote the set of all feed-forward neural networks for the form $$ f_{\theta}(x):= A_n\circ \sigma \dots \circ \sigma \circ A_1, $$ where $A_i(x)=M_ix+b_i$ where $M_i$ is an $m_{i+1}\times m_{i}$ matrix and $b_i$ is a vector in $\mathbb{R}^{m_{i+1}}$, and $\theta=(M_1,\dots,M_n,b_1,\dots,b_n) \in \mathbb{R}^L$.

Note: $\sigma$ is applied component-wise.

Let $l:\mathbb{R}^{m_n}\rightarrow [0,\infty)$ be a $C^1$ function with Lipschtiz gradient, let $\{x_i\}_{i=1}^N$ be elements of $\mathbb{R}^{m_1}$, and let $$ L(\theta)\triangleq \frac1{N}\sum_{i=1}^N \ell(f_{\theta}(x_i)) . $$

Question:

Suppose we want to optimize $L(\theta)$ over $\mathbb{R}^L$ using (resp. mini-batch) stochastic gradient descent (SGD). Then I have the following question:

What is the complexity of computing a single gradient update? This is probably a function of the data size, no?
Is the convergence rate of the (resp. mini-batch) SGD a function of $N$ (and/or the batch-size $b$)?

Intuitively one of these two must be a function of the data-size (and the mini-batch size).

I'm looking for non-empirical result and I don't want to assume that $\sigma$ is an affine map (like in the classical MLP).

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Each SGD iteration involves computing $\nabla_{\theta} \ell(f_{\theta}(x_j)) = d\ell(f_{\theta}(x_j)) \nabla_{\theta}f_{\theta}(x_j)$ for some (randomly chosen) index $j \in \{1,\cdots,N\}$. Therefore, the cost of this step equals the cost of evaluating $f_{\theta}(x_j)$ and its gradient. The cost of evaluating the function $f_{\theta}(x_j)$ is $O(m_1m_2 + m_2m_3 + \cdots + m_{n}m_{n+1})$. To see this, note that evaluating $\sigma \circ A_1(x)$ involves multiplying a matrix of dimension $m_2 \times m_1$ with a vector of dimension $m_1$, then adding two vectors of dimension $m_2$, and finally composing with the function $\sigma$ a total of $m_2$ times. During this computation, we can store the intermediate values $\sigma \circ A_1(x), \sigma \circ A_2(\sigma \circ A_1(x))$, and so on. From automatic differentiation, we know that the cost of gradient evaluation is a constant factor times this quantity, see Theorem 4.1 in this reference for example. If the minibatch size is $B$, then the per-iteration cost of minibatch SGD is $B$ times this quantity (assuming that gradients and function values are not reused even if duplicate choices of $j$ are present in the minibatch; even otherwise, the cost is roughly the same on average).
Let $g_i(\theta) := \nabla_{\theta} \ell(f_{\theta}(x_i))$. Then, the gradient at $\theta$ equals $\frac{1}{N}\sum_{i=1}^{N} g_i(\theta)$. From Corollary 2.2 of this paper, we have that the "level of suboptimality" (measured in term of the norm of the gradient for nonconvex problems) after iteration $T$ is $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{T}}\right)$ where $\sigma > 0$ is the standard deviation of the stochastic gradient (sorry for the clash in notation). From basic statistics, we know that the variance of $g_j(\theta)$ for $j$ chosen uniformly at random from $\{1,\cdots,N\}$ does not depend on the number of samples $N$. In other words, if we use a random draw from a probability distribution to estimate its mean, then the variance of this single draw-based sample mean does not depend on the size of the population. Therefore, the rate of convergence of the stochastic gradient method does not depend on the sample size $N$. If we use a minibatch of size $B$ of i.i.d. draws, then the standard deviation shrinks by a factor of $\sqrt{B}$ (again from basic statistics on the sample mean), so the rate of convergence now becomes $O\left(\dfrac{1}{T}\right) + O\left(\dfrac{\sigma}{\sqrt{BT}}\right)$. Note that if we use a full gradient evaluation, then $\sigma = 0$ and we get the faster rate of convergence $O\left(\dfrac{1}{T}\right)$. Note that in some cases, we don't even have this difference between stochastic and deterministic methods, see the analyses in Sections 3.3 and 3.4 here, for instance.

Convergence Rate and Complexity of Single SGD update for Neural Networks

There are 1 best solutions below

Related Questions in OPTIMIZATION

Related Questions in COMPUTATIONAL-COMPLEXITY

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in NEURAL-NETWORKS

Related Questions in NON-CONVEX-OPTIMIZATION

Trending Questions

Popular # Hahtags

Popular Questions