Let's say we want to minimise
$$f(x) := \frac{1}{2}\|Ax-b\|_2^2 + \frac{\lambda}{2}\|x\|_2^2$$
where $A\in\mathbb{R}^{D\times N}, b\in \mathbb{R}^D,\lambda\in\mathbb{R}$
How do we write $f(x)$ as a sum of auxiliary functions, i.e. $f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x)$ ?
A gradient descent algorithm would iterate
$x^{(t+1)}\leftarrow x^{(t)}-\eta_t\nabla f(x)$
for some step-size $\eta_t$
Calculating
$\nabla f(x) = \frac{\partial}{\partial x}\left(Ax-b\right)\cdot \frac{1}{2}\frac{Ax-b}{\|Ax-b\|}\cdot 2\|Ax-b\| + \frac{\lambda}{2}2x$
$=A^\top \left( Ax -b\right) + \lambda x$
isn't a problem, but it's slow.
The stochastic gradient descent assumes
$f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x)$
Then for each update step,
$i\overset{uni}{\sim}\{1,..,N\}$
is chosen uniformly at random and
$x^{(t+1)}\leftarrow x^{(t)}-\eta_t \nabla f_i(x)$
Which is a factor $N$ cheaper to compute but has to live with a certain randomness of the descent direction.
$f(x) := \frac{1}{2}\|Ax-b\|_2^2 + \frac{\lambda}{2}\|x\|_2^2$
$=\frac{1}{2} (x^\top A^\top-b^\top)(Ax - b) + \frac{\lambda}{2}x^\top x$
$= \frac{1}{2} (-b^\top Ax +b^\top b + x^\top A^\top Ax - x^\top A^\top b) + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$
$=\frac{1}{2} \left[-b^\top (\sum_{k=1}^N a_k x_k) +b^\top b + (\sum_{k=1}^N a_k x_k)^\top (\sum_{k=1}^N a_k x_k) - (\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$
$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N a_k x_k)^\top (\sum_{k=1}^N a_k x_k) - 2(\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$
$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N a_k^\top x_k(\sum_{j=1}^N a_j x_j)) - 2(\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$
$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N\sum_{j=1}^N x_k x_j a_k^\top a_j) - 2(\sum_{k=1}^N a_k^\top x_k) b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$
$=\sum_{k=1}^N\left(\frac{1}{2} \left[\frac{b^\top b}{N} + (\sum_{j=1}^N x_k x_j a_k^\top a_j) - 2x_k a_k^\top b\right] + \frac{\lambda}{2}x_k^2\right)$
$=\sum_{k=1}^N\left(\frac{1}{2} \left[\frac{b^\top b}{N} + x_k a_k^\top(\sum_{j=1}^N x_ja_j) - 2x_k a_k^\top b\right] + \frac{\lambda}{2}x_k^2\right)$
$\Rightarrow f_i(x) = \frac{1}{2} \left[\frac{b^\top b}{N} + x_i a_i^\top Ax - 2x_i a_i^\top b\right] + \frac{\lambda}{2}x_i^2$
Now to calculate its gradient:
$\frac{\partial f_i(x)}{\partial x_j}=\begin{cases} \frac{\partial}{\partial x_j}\left(\frac{1}{2}x_ia_i^\top Ax\right) = \frac{1}{2}x_i a_i^\top a_j &,j \neq i\\ \frac{1}{2}\left(a_i^\top Ax + x_i a_i^\top a_i -2a_i^\top b\right) + \lambda x_i &, else \end{cases}$
However, having $Ax$ in $f_i(x)$ and in the gradient feels a bit strange and I have no way of verifying my results.
Would you agree with above decomposition? Or if not, how would you decompose $f(x)$?