Derivative transpose (follow up)

168 Views Asked by At

I'm working through the Elements of Statistical learning, and I have a quick followup to the below question:

derivative transpose

In the answer accepted, it states the below: $$ \frac{\partial}{\partial\beta}(\beta^T X^TX\beta) = 2X^TX\beta $$

For context, $X$ is a $n\times p$ matrix, and $\beta$ is a $p \times 1$ matrix.

I'm a bit confused about that. When I do the product rule, I see the below: $$ \frac{\partial}{\partial\beta}(\beta^T X^TX\beta) = (\beta^TX^T)'(X\beta)+(\beta^TX^T)(X\beta)' = X^TX\beta+ \beta^TX^TX $$

My issue now is that $X^TX\beta$ is a $p\times 1$ matrix, and $\beta^TX^TX$ is a $1\times p$ matrix. I mean, I guess it doesn't matter since they're both vectors and each other's transposes, but this seems shaky from a dimensionality perspective. Am I missing something?

Thanks!

2

There are 2 best solutions below

8
On BEST ANSWER

This is a classic example of why it is a bad idea to differentiate with respect to vectors! Do it componentwise and you get the following. $$\frac{\partial}{\partial\beta_p}(\beta_iX_{ki} X_{kj}\beta_j)=X_{kp}X_{kj}\beta_j+\beta_iX_{ki}X_{kp}=(2X^TX\beta)_p$$It is conventional for the derivative of a scalar with respect to a column vector to be a column vector, so we must have that $$\frac{\partial}{\partial\beta}(\beta^TX^TX\beta)=2X^TX\beta$$Note that $(2X^TX\beta)_p=(2\beta^TX^TX)_p$, so if we wanted our derivative to be a row vector, this would also be true.


See also: this recent answer of mine to a very similar question.

3
On

Let $F(\beta)=\beta^TX^TX\beta$. You want to compute $\frac{\partial F}{\partial \beta}$. This can be found by considering a small $p\times 1$ vector $\def\e{\varepsilon}\e$ and looking at $$ F(\beta+\e)-F(\beta)=\e^TX^TX\beta+\beta^TX^TX\e +\require{cancel}\cancelto{o(\|\e\|)}{\e^TX^TX\e}\approx\e^TX^TX\beta+\beta^TX^TX\e=2\e^TX^TX\beta $$ The last equality follows since $\beta^TX^TX\e$ is a $1\times 1$ matrix, and therefore equal to its transpose.

This shows that the difference $F(\beta+\e)-F(\beta)$ is well approximated by $\epsilon^T$ times $2X^TX\beta$, so that $2X^TX\beta$ is the derivative of $F$.