I have this function of $\beta$: $$f(\beta)=\left(\textbf{y}-\textbf{X}\beta\right)^T\left(\textbf{y}-\textbf{X}\beta\right)$$
Where:
- $\textbf{y}$ is a $N \times 1$ column vector.
- $\textbf{X}$ is a $N \times p$ matrix.
- Therefore $\beta$ is a $p \times 1$ column vector.
I'm asked to differentiate $f(\beta)$ with respect to $\beta$, but I've never worked with matrices when it comes to differentiating, I find it a bit difficult.
I looked for help on some books I have and on the internet, and found these expressions:
$\left(\partial/\partial_{\textbf{x}}\right)\textbf{x}^T\textbf{y}=\left(\partial/\partial_{\textbf{x}}\right)\textbf{y}^T\textbf{x}=\textbf{y}$
$\left(\partial/\partial_{\textbf{x}}\right)\textbf{x}^TA\textbf{y}=\left(\partial/\partial_{\textbf{x}}\right)\textbf{y}^TA^T\textbf{x}=A\textbf{y}$
But I can't seem to be able to apply them in my case. Any help is more than appreciated, I'm still in my learning stage with mathematics.
EDIT
I've tried to apply the chain rule to no avail, apparently, because it doesn't match with the final solution given by the book I took this problem from: $$\dfrac{\partial f(\beta)}{\partial \beta}=-\textbf{X}^T\left(\textbf{y}-\textbf{X}\beta\right)-\left(\textbf{y}-\textbf{X}\beta\right)^T\textbf{X}$$
Just play a Taylor series trick. Recall, Taylor series tell us that $$ f(x+\partial x) = f(x) + f^\prime(x)\partial x + o(\|\partial x\|) $$ Hence, we can see that \begin{align*} f(\beta+\partial \beta) =&(y-X(\beta+\partial \beta))^T(y-X(\beta+\partial \beta))\\ =&(y-X\beta+X\partial \beta)^T(y-X\beta+X\partial \beta)\\ =&\underbrace{(y-X\beta)^T(y-X\beta)}_{f(\beta)}\underbrace{-(X\partial \beta)^T(y-X\beta) - (y-X\beta)^T(X\partial \beta)}_{f^\prime(\beta)\partial \beta}+\underbrace{(X\partial \beta)^T(X\partial\beta)}_{o(\|\partial \beta\|}\\ \end{align*} Basically, we just expand $f(\beta + \partial \beta)$, regroup terms, and Taylor's theorem tells us which one the derivative is. This gives \begin{align*} f^\prime(\beta)\partial \beta =&-(X\partial \beta)^T(y-X\beta) - (y-X\beta)^T(X\partial \beta)\\ =&-\partial \beta^TX^T(y-X\beta) - (y-X\beta)^TX\partial \beta\\ =&-(X^T(y-X\beta))^T\partial \beta - (y-X\beta)^TX\partial \beta\\ =&-2(X^T(y-X\beta))^T\partial \beta\\ \end{align*}
From the Riesz representation theorem we get that $\langle \nabla f(\beta),\partial \beta\rangle = f^\prime(\beta)\partial \beta$ or that $\nabla f(\beta)^T\partial\beta = f^\prime(\beta)\partial \beta$. Matching terms, we get the result we want which is $$ \nabla f(\beta) = -2X^T(y-X\beta). $$