Matrix Derivative of $(\mathbf{Y-X \beta})^T\mathbf{P}(\mathbf{Y-X \beta})$

190 Views Asked by At

I am trying to calculate the derivative of $$(\mathbf{Y-X \beta})^T\mathbf{P}(\mathbf{Y-X \beta}) $$ where $\mathbf{P}$ is a positive definite matrix. The actual dimensions of each element is not given in the question specification, but since it is for the purposes of minimising $\beta$ for regression analysis, I think $\mathbf{X}$ is mxn, $\mathbf{\beta}\in \mathbf{R}^n$ and $\mathbf{Y}\in \mathbf{R}^m$. First, I expand the expression,

$$(\mathbf{Y-X \beta})^T\mathbf{P}(\mathbf{Y-X \beta}) = (\mathbf{Y^TP-\beta^T\mathbf{X}^TP})(\mathbf{Y-X \beta}) = \mathbf{Y^TPY-Y^TPX\beta -\beta^TX^TPY+\beta^TX^TPX\beta} $$

Now I take the derivative for wrt $\beta$. For the final term, I am using that it is a quadratic form and I think I am assuming $\mathbf{X^TPX}$ is symmetric. I am just using identities on - https://en.wikipedia.org/wiki/Matrix_calculus Anyway I get,

$$\mathbf{-Y^TPX-Y^TPX}+2\mathbf{\beta^TX^TPX} = -2\mathbf{Y^TPX+2\beta^TX^TPX}$$

From here, I can equate to $0$ and take the transpose, to solve for $\beta$ (assuming everything is inversable for now, don't worry).

$$\mathbf{\beta^TX^TPX=Y^TPX}\iff \mathbf{X^TPX\beta=X^TPY} \iff \beta=\mathbf{(X^TPX)^{-1}X^TPY}$$

The solutions solve it slightly differently. They said since, $(\mathbf{Y-X \beta})^T\mathbf{P}(\mathbf{Y-X \beta})$ is already a quadratic form, we can just use this to calculate the derivative as $$\mathbf{-X^T}2\mathbf{P(Y-X\beta})=-2\mathbf{X^TPY} + 2{\mathbf{X^TPX\beta}}$$. As you can see, this is the same as my derivative, but transposed. Of course, once I transpose to solve for $\beta$, this is no longer the case and we get the same final solution. I have 2 questions.

  1. Is the method I have done incorrect, i.e. if the question was just calculate the derivative have I done it incorrectly. If so would you kindly point out where I have made my mistake?

  2. Could anyone recommend some literature/web page that explains the process the solutions took for taking the derivative by spotting it was a quadratic form.

Thank you very much!

2

There are 2 best solutions below

0
On

The derivative you want is Fréchet derivative (see https://en.wikipedia.org/wiki/Fr%C3%A9chet_derivative). Let $$ \mathbf f(\beta)=(\mathbf{Y-X\beta})^T\mathbf{P}(\mathbf{Y-X \beta}). $$ Then \begin{eqnarray} D\mathbf f(\beta)\mathbf h&=&\lim_{t\to0}\frac{\mathbf f(\beta+t\mathbf h)-\mathbf f(\beta)}{t}\\ &=&\lim_{t\to0}\frac{-t(\mathbf{Y}-\mathbf{X}\beta)^T\mathbf{Ph}-t\mathbf{h}^T\mathbf{P}(\mathbf{Y}-\mathbf{X}\beta)+t^2\mathbf{h}^T\mathbf{Ph}}{t}\\ &=&-(\mathbf{Y}-\mathbf{X}\beta)^T\mathbf{Ph}-\mathbf{h}^T\mathbf{P}(\mathbf{Y}-\mathbf{X}\beta). \end{eqnarray}

0
On

$\def\d{\cdot}\def\p#1#2{\frac{\partial #1}{\partial #2}}$ The use of an explicit dot product often prevents transposition errors such as the one that you encountered, and reducing visual clutter will minimize distractions during the differentiation process. Towards that end, define the working vector $$\eqalign{ w &= X\d b-y \\ }$$ Write the regression error in terms of this new vector, calculate the gradient, then substitute the original variables. $$\eqalign{ {\cal E} &= w\d P\d w \\ d{\cal E} &= 2w\d P\d dw = 2w\d P\d X\d db = 2\big(X^T\d P\d w\big)\d db \\ \p{\cal E}{b} &= 2X^T\d P\d w = 2X^T\d P\d \big(X\d b-y\big) \\ }$$ NB:   The $P$ matrix has been assumed to be symmetric, if that's not the case it should be replaced by its symmetric component, i.e. $\;P=\tfrac 12\left(P+P^T\right).$

Now you can proceed as usual: set the gradient to zero and solve for the optimal $b$ vector. $$\eqalign{ P &= L\d L^T &\qquad\big({\rm Cholesky\,factorization}\big) \\ R &= L^T\d X \\ R^T\d R\d b &= R^T\d L^T\d y \\ b &= R^+\d L^T\d y &\qquad\big(R^+{\rm \,is\,the\,Moore\,Penrose\,inverse}\big) \\ }$$