How to differentiate with respect a vector in this matrix expression?

539 Views Asked by At

I have this function of $\beta$: $$f(\beta)=\left(\textbf{y}-\textbf{X}\beta\right)^T\left(\textbf{y}-\textbf{X}\beta\right)$$

Where:

  • $\textbf{y}$ is a $N \times 1$ column vector.
  • $\textbf{X}$ is a $N \times p$ matrix.
  • Therefore $\beta$ is a $p \times 1$ column vector.

I'm asked to differentiate $f(\beta)$ with respect to $\beta$, but I've never worked with matrices when it comes to differentiating, I find it a bit difficult.

I looked for help on some books I have and on the internet, and found these expressions:

  • $\left(\partial/\partial_{\textbf{x}}\right)\textbf{x}^T\textbf{y}=\left(\partial/\partial_{\textbf{x}}\right)\textbf{y}^T\textbf{x}=\textbf{y}$

  • $\left(\partial/\partial_{\textbf{x}}\right)\textbf{x}^TA\textbf{y}=\left(\partial/\partial_{\textbf{x}}\right)\textbf{y}^TA^T\textbf{x}=A\textbf{y}$

But I can't seem to be able to apply them in my case. Any help is more than appreciated, I'm still in my learning stage with mathematics.


EDIT

I've tried to apply the chain rule to no avail, apparently, because it doesn't match with the final solution given by the book I took this problem from: $$\dfrac{\partial f(\beta)}{\partial \beta}=-\textbf{X}^T\left(\textbf{y}-\textbf{X}\beta\right)-\left(\textbf{y}-\textbf{X}\beta\right)^T\textbf{X}$$

3

There are 3 best solutions below

5
On BEST ANSWER

Just play a Taylor series trick. Recall, Taylor series tell us that $$ f(x+\partial x) = f(x) + f^\prime(x)\partial x + o(\|\partial x\|) $$ Hence, we can see that \begin{align*} f(\beta+\partial \beta) =&(y-X(\beta+\partial \beta))^T(y-X(\beta+\partial \beta))\\ =&(y-X\beta+X\partial \beta)^T(y-X\beta+X\partial \beta)\\ =&\underbrace{(y-X\beta)^T(y-X\beta)}_{f(\beta)}\underbrace{-(X\partial \beta)^T(y-X\beta) - (y-X\beta)^T(X\partial \beta)}_{f^\prime(\beta)\partial \beta}+\underbrace{(X\partial \beta)^T(X\partial\beta)}_{o(\|\partial \beta\|}\\ \end{align*} Basically, we just expand $f(\beta + \partial \beta)$, regroup terms, and Taylor's theorem tells us which one the derivative is. This gives \begin{align*} f^\prime(\beta)\partial \beta =&-(X\partial \beta)^T(y-X\beta) - (y-X\beta)^T(X\partial \beta)\\ =&-\partial \beta^TX^T(y-X\beta) - (y-X\beta)^TX\partial \beta\\ =&-(X^T(y-X\beta))^T\partial \beta - (y-X\beta)^TX\partial \beta\\ =&-2(X^T(y-X\beta))^T\partial \beta\\ \end{align*}

From the Riesz representation theorem we get that $\langle \nabla f(\beta),\partial \beta\rangle = f^\prime(\beta)\partial \beta$ or that $\nabla f(\beta)^T\partial\beta = f^\prime(\beta)\partial \beta$. Matching terms, we get the result we want which is $$ \nabla f(\beta) = -2X^T(y-X\beta). $$

1
On

Expanding the expression gives us $$f(\beta) = (y-X\beta)^T(y-X\beta) = y^Ty - \beta^TX^Ty - y^TX\beta + \beta^T X^T X \beta$$ Therefore, $$\frac{\partial f}{\partial \beta} = 0-X^Ty-y^TX + \frac{\partial }{\partial \beta} \beta^T A \beta$$ where $A = X^TX$. Note that $$\beta^T A\beta = \sum_{k=1}^p \sum_{\ell=1}^p \beta_k\beta_\ell A_{k\ell}$$ and so $$\frac{\partial}{\partial \beta_i} \beta^T A \beta = \sum_{k=1}^p \sum_{\ell=1}^p \frac{\partial}{\partial \beta_i}\beta_k\beta_\ell A_{k\ell} = \sum_{k=1}^p \sum_{\ell=1}^pA_{k\ell}\beta_\ell \frac{\partial \beta_k}{\partial \beta_i}+A_{k\ell}\beta_k\frac{\partial \beta_\ell}{\partial \beta_i}$$ $$= \sum_{k=1}^p \sum_{\ell=1}^pA_{k\ell}\beta_\ell \delta_{ki}+A_{k\ell}\beta_k\delta_{\ell i} = \sum_{k=1}^p A_{ki}\beta_k + \sum_{\ell=1}^p A_{i\ell}\beta_\ell = (\beta^TA+A\beta)_i$$ Thus, $$\frac{\partial f}{\partial \beta} = \beta^T X^TX + X^TX\beta - X^Ty-y^TX$$ which is equivalent to your book's solution.

Also, the formulas you gave are incorrect, since the dimensions don't work out. You should have $$\frac{\partial }{\partial x} y^Tx = y^T$$ $$\frac{\partial}{\partial x} y^TA^Tx = y^TA^T$$

1
On

Define a new vector
$$\eqalign{ w &= X\beta-y \cr}$$ Then write the function in terms of the inner/Frobenius product (denoted by a colon) and this new variable. In this new form, finding the differential and gradient is straightforward. $$\eqalign{ f &= w:w \cr\cr df &= 2w:dw \cr &= 2w:X\,d\beta \cr &= 2X^Tw:d\beta \cr\cr \frac{\partial f}{\partial\beta} &= 2X^Tw \cr &= 2X^T(X\beta-y) \cr\cr }$$ Don't be put-off by the Frobenius product, it's merely a convenient infix notation for the trace $$\eqalign{A:B={\rm tr}(A^TB)\cr}$$