decomposing euclidean norm for stochastic gradient descent

168 Views Asked by At

Let's say we want to minimise

$$f(x) := \frac{1}{2}\|Ax-b\|_2^2 + \frac{\lambda}{2}\|x\|_2^2$$

where $A\in\mathbb{R}^{D\times N}, b\in \mathbb{R}^D,\lambda\in\mathbb{R}$

How do we write $f(x)$ as a sum of auxiliary functions, i.e. $f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x)$ ?

A gradient descent algorithm would iterate

$x^{(t+1)}\leftarrow x^{(t)}-\eta_t\nabla f(x)$

for some step-size $\eta_t$

Calculating

$\nabla f(x) = \frac{\partial}{\partial x}\left(Ax-b\right)\cdot \frac{1}{2}\frac{Ax-b}{\|Ax-b\|}\cdot 2\|Ax-b\| + \frac{\lambda}{2}2x$

$=A^\top \left( Ax -b\right) + \lambda x$

isn't a problem, but it's slow.

The stochastic gradient descent assumes

$f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x)$

Then for each update step,

$i\overset{uni}{\sim}\{1,..,N\}$

is chosen uniformly at random and

$x^{(t+1)}\leftarrow x^{(t)}-\eta_t \nabla f_i(x)$

Which is a factor $N$ cheaper to compute but has to live with a certain randomness of the descent direction.

$f(x) := \frac{1}{2}\|Ax-b\|_2^2 + \frac{\lambda}{2}\|x\|_2^2$

$=\frac{1}{2} (x^\top A^\top-b^\top)(Ax - b) + \frac{\lambda}{2}x^\top x$

$= \frac{1}{2} (-b^\top Ax +b^\top b + x^\top A^\top Ax - x^\top A^\top b) + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$

$=\frac{1}{2} \left[-b^\top (\sum_{k=1}^N a_k x_k) +b^\top b + (\sum_{k=1}^N a_k x_k)^\top (\sum_{k=1}^N a_k x_k) - (\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$

$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N a_k x_k)^\top (\sum_{k=1}^N a_k x_k) - 2(\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$

$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N a_k^\top x_k(\sum_{j=1}^N a_j x_j)) - 2(\sum_{k=1}^N a_k x_k)^\top b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$

$=\frac{1}{2} \left[\sum_{k=1}^N\frac{b^\top b}{N} + (\sum_{k=1}^N\sum_{j=1}^N x_k x_j a_k^\top a_j) - 2(\sum_{k=1}^N a_k^\top x_k) b\right] + \frac{\lambda}{2}\sum_{k=1}^N x_k^2$

$=\sum_{k=1}^N\left(\frac{1}{2} \left[\frac{b^\top b}{N} + (\sum_{j=1}^N x_k x_j a_k^\top a_j) - 2x_k a_k^\top b\right] + \frac{\lambda}{2}x_k^2\right)$

$=\sum_{k=1}^N\left(\frac{1}{2} \left[\frac{b^\top b}{N} + x_k a_k^\top(\sum_{j=1}^N x_ja_j) - 2x_k a_k^\top b\right] + \frac{\lambda}{2}x_k^2\right)$

$\Rightarrow f_i(x) = \frac{1}{2} \left[\frac{b^\top b}{N} + x_i a_i^\top Ax - 2x_i a_i^\top b\right] + \frac{\lambda}{2}x_i^2$

Now to calculate its gradient:

$\frac{\partial f_i(x)}{\partial x_j}=\begin{cases} \frac{\partial}{\partial x_j}\left(\frac{1}{2}x_ia_i^\top Ax\right) = \frac{1}{2}x_i a_i^\top a_j &,j \neq i\\ \frac{1}{2}\left(a_i^\top Ax + x_i a_i^\top a_i -2a_i^\top b\right) + \lambda x_i &, else \end{cases}$

However, having $Ax$ in $f_i(x)$ and in the gradient feels a bit strange and I have no way of verifying my results.

Would you agree with above decomposition? Or if not, how would you decompose $f(x)$?