I have a function:
$f(\textbf{x}) = \frac{1}{2} || \textbf{Ax - b} ||_2^2$
I am trying to minimize this function on values of x using gradient based optimization. The textbook I am following suggests the gradient can be computed as:
$\nabla_\textbf{x}f(\textbf{x}) = \textbf{A}^T(\textbf{Ax} - \textbf{b})$.
I am not understanding what has happened here. I am familiar with the notion of gradient, and how the function itself works, but I am not understanding why this computation represents the gradient.
In $$f(\textbf{x}) = \frac{1}{2} || \textbf{Ax - b} ||_2^2$$ you have a vector $\textbf v=\textbf{Ax} - \textbf{b}$. The absolute value is obtained either by dot product, or you can write it in vector form $\textbf v^T\textbf v$
Now the gradient is the same as taking a derivative with respect to $\textbf x$
$$\nabla_\textbf{x}f(\textbf{x}) = \frac{d\textbf v^T}{d\textbf x}\textbf v+ \textbf v^T\frac{d\textbf v}{d\textbf x} =\textbf{A}^T(\textbf{Ax} - \textbf{b})$$ In the last equation I've used the property of transpose of the product of a matrix with a vector.