Taking the gradient of $f(\mathbf{x}) = \frac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$

638 Views Asked by At

Section 4.5 of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says that the gradient of

$$f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$$

is

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$$

My understanding is that $f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$ is the square of the Euclidean norm. So we have that

$$\begin{align} f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2 &= \dfrac{1}{2} \left( \sqrt{(\mathbf{A} \mathbf{x} - \mathbf{b})^2} \right)^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})(\mathbf{A} \mathbf{x} - \mathbf{b}) \\ &= \dfrac{1}{2} [ (\mathbf{A}\mathbf{x})(\mathbf{A} \mathbf{x}) - (\mathbf{A} \mathbf{x})\mathbf{b} - (\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2 ] \ \ \text{(Since matrix multiplication is distributive.)} \\ &= \dfrac{1}{2} [(\mathbf{A} \mathbf{x})^2 - 2(\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2] \ \ \text{(Note: Matrix multiplication is not commutative.)} \end{align}$$

It's at this point that I realised that, since we're working with matrices, I'm not really sure how to take the gradient of this. Taking the gradient of $f(\mathbf{x})$ with respect to $\mathbf{x}$, we get something like

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \dfrac{1}{2} [2 (\mathbf{A} \mathbf{x}) \mathbf{A}] - \dfrac{1}{2}[2(\mathbf{A} \mathbf{A} \mathbf{x})\mathbf{b}]$$

So what is the reasoning that leads us to get $\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$? Where did the transposed matrices come from?

I would greatly appreciate it if people would please take the time to clarify this.

1

There are 1 best solutions below

3
On BEST ANSWER

We must take the derivative with finesse, and that means we use the chain rule. Note that $f = g \circ h$, where $h(x) = Ax-b$ and $g(u) = (1/2) \|u\|^2$. The derivatives of $h$ and $g$ are $h'(x) = A$ and $g'(u) = u^T$. So by the chain rule $$ f'(x) = g'(h(x)) h'(x) = (Ax-b)^T A. $$ The gradient of $f$ is $$ \nabla f(x) = f'(x)^T = A^T(Ax-b). $$