I'm working through the Elements of Statistical learning, and I have a quick followup to the below question:
In the answer accepted, it states the below: $$ \frac{\partial}{\partial\beta}(\beta^T X^TX\beta) = 2X^TX\beta $$
For context, $X$ is a $n\times p$ matrix, and $\beta$ is a $p \times 1$ matrix.
I'm a bit confused about that. When I do the product rule, I see the below: $$ \frac{\partial}{\partial\beta}(\beta^T X^TX\beta) = (\beta^TX^T)'(X\beta)+(\beta^TX^T)(X\beta)' = X^TX\beta+ \beta^TX^TX $$
My issue now is that $X^TX\beta$ is a $p\times 1$ matrix, and $\beta^TX^TX$ is a $1\times p$ matrix. I mean, I guess it doesn't matter since they're both vectors and each other's transposes, but this seems shaky from a dimensionality perspective. Am I missing something?
Thanks!
This is a classic example of why it is a bad idea to differentiate with respect to vectors! Do it componentwise and you get the following. $$\frac{\partial}{\partial\beta_p}(\beta_iX_{ki} X_{kj}\beta_j)=X_{kp}X_{kj}\beta_j+\beta_iX_{ki}X_{kp}=(2X^TX\beta)_p$$It is conventional for the derivative of a scalar with respect to a column vector to be a column vector, so we must have that $$\frac{\partial}{\partial\beta}(\beta^TX^TX\beta)=2X^TX\beta$$Note that $(2X^TX\beta)_p=(2\beta^TX^TX)_p$, so if we wanted our derivative to be a row vector, this would also be true.
See also: this recent answer of mine to a very similar question.