So I want to differentiate $f(\beta) = (\vec{y} -X\beta)^T(\vec{y} -X\beta)$ using the product rule. Here:
- $\vec{y}$ is an $n \times 1$ vector
- $X$ is an $ n \times p$ matrix
- $\beta$ is a $p \times 1$ vector
In particular, say I just want to expand the original expression like this: $$f(\beta) = (\vec{y} -X\beta)^T(\vec{y} -X\beta) = (\vec{y}^T -\beta^TX^T)(\vec{y} -X\beta).$$
Then, I want to just apply the product rule $$\frac{\mathrm{d}f(\beta)}{\mathrm{d}\beta} =-X^T(\vec{y} -X\beta) - (\vec{y}^T -\beta^TX^T)X = $$ $$ = -X^T\vec{y} + X^TX\beta-\vec{y}^TX+\beta^TX^TX =$$ $$= -X^T\vec{y}+X^TX\beta-(X^T\vec{y})^T + (X^TX\beta)^T.$$
But this doesn't work because the first two terms have dimensions $p \times 1$, while the last two have dimensions $1 \times p$. I know I should be able to combine the terms, so what exactly goes wrong in a derivation like this.
Note that I don't want to expand the original expression further and then take the derivative; I've seen it done that way, but I really want to figure out why this doesn't work, i.e. what rule I'm missing. Also, I know I could just use the chain rule and the fact that $$\frac{\mathrm{d}}{\mathrm{d}x} x^T x = 2x^T,$$ but I still want to figure out why the product rule doesn't work in the naive way I wanted to do it above.
edit: Hmm, is it because the derivative is a function that acts on a vector in this case? So that if we call that vector $\vec{z}$, we would get
$$f'(\beta) \vec{z} = -(X\vec{z})^T(\vec{y} -X\beta) - (\vec{y}^T -\beta^TX^T)(X \vec{z}) = $$ $$ = -\vec{z}^TX^T\vec{y}+\vec{z}^TX^TX\beta-\vec{y}^TX\vec{z} + \beta^TX^TX\vec{z}.$$
But then because these are scalar quantities, we have
$$-\vec{z}^TX^T\vec{y} = (-\vec{z}^TX^T\vec{y})^T = -\vec{y}^TX\vec{z} \text{, and}$$ $$ \vec{z}X^TX\beta = (\vec{z}^TX^TX\beta)^T = \beta^TX^TX\vec{z}.$$
So what I wrote above would be perfectly correct, it's just that I could simplify it further this way by taking into account what the derivative is and how it acts?
Consider a scalar function of two real vectors and calculate its differential. $$\eqalign{ f &= a^Tc \,\,= c^Ta \cr df &= a^Tdc + c^Tda \cr }$$ Now suppose you're told that $c$ is actually a function of $a$, i.e. $\,c=a.$
That's easy enough to handle. $$\eqalign{ df &= 2a^Tda\cr }$$ Now suppose you're told that $a$ itself is a function of $\beta$, i.e. $\,a=(X\beta-y)$
Again, this doesn't change things too much. $$\eqalign{ df &= 2a^TX\,d\beta \cr }$$ Now let's collect terms into a single vector $\,g=2X^Ta,\,$ substitute it into the expression, and isolate the gradient vector. $$\eqalign{ df &= g^Td\beta \cr \frac{\partial f}{\partial\beta} &= g = 2X^T(X\beta-y) \cr\cr }$$ The problem with your approach is that it assumes the existence of a rule $$ \frac{\partial(a^Tc)}{\partial\beta} = \Big(\frac{\partial a}{\partial\beta}\Big)^Tc + a^T\Big(\frac{\partial c}{\partial\beta}\Big) $$ which turns out to be false.
The correct rule is $$ \frac{\partial(a^Tc)}{\partial\beta} = \Big(\frac{\partial a}{\partial\beta}\Big)^Tc + \Big(\frac{\partial c}{\partial\beta}\Big)^Ta $$ or the transpose of this, depending on your preferred layout convention.