Find the derivative of $f(\beta) = (\vec{y} -X\beta)^T(\vec{y} -X\beta)$ using the product rule

218 Views Asked by At

So I want to differentiate $f(\beta) = (\vec{y} -X\beta)^T(\vec{y} -X\beta)$ using the product rule. Here:

  • $\vec{y}$ is an $n \times 1$ vector
  • $X$ is an $ n \times p$ matrix
  • $\beta$ is a $p \times 1$ vector

In particular, say I just want to expand the original expression like this: $$f(\beta) = (\vec{y} -X\beta)^T(\vec{y} -X\beta) = (\vec{y}^T -\beta^TX^T)(\vec{y} -X\beta).$$

Then, I want to just apply the product rule $$\frac{\mathrm{d}f(\beta)}{\mathrm{d}\beta} =-X^T(\vec{y} -X\beta) - (\vec{y}^T -\beta^TX^T)X = $$ $$ = -X^T\vec{y} + X^TX\beta-\vec{y}^TX+\beta^TX^TX =$$ $$= -X^T\vec{y}+X^TX\beta-(X^T\vec{y})^T + (X^TX\beta)^T.$$

But this doesn't work because the first two terms have dimensions $p \times 1$, while the last two have dimensions $1 \times p$. I know I should be able to combine the terms, so what exactly goes wrong in a derivation like this.

Note that I don't want to expand the original expression further and then take the derivative; I've seen it done that way, but I really want to figure out why this doesn't work, i.e. what rule I'm missing. Also, I know I could just use the chain rule and the fact that $$\frac{\mathrm{d}}{\mathrm{d}x} x^T x = 2x^T,$$ but I still want to figure out why the product rule doesn't work in the naive way I wanted to do it above.

edit: Hmm, is it because the derivative is a function that acts on a vector in this case? So that if we call that vector $\vec{z}$, we would get

$$f'(\beta) \vec{z} = -(X\vec{z})^T(\vec{y} -X\beta) - (\vec{y}^T -\beta^TX^T)(X \vec{z}) = $$ $$ = -\vec{z}^TX^T\vec{y}+\vec{z}^TX^TX\beta-\vec{y}^TX\vec{z} + \beta^TX^TX\vec{z}.$$

But then because these are scalar quantities, we have

$$-\vec{z}^TX^T\vec{y} = (-\vec{z}^TX^T\vec{y})^T = -\vec{y}^TX\vec{z} \text{, and}$$ $$ \vec{z}X^TX\beta = (\vec{z}^TX^TX\beta)^T = \beta^TX^TX\vec{z}.$$

So what I wrote above would be perfectly correct, it's just that I could simplify it further this way by taking into account what the derivative is and how it acts?

2

There are 2 best solutions below

1
On

Consider a scalar function of two real vectors and calculate its differential. $$\eqalign{ f &= a^Tc \,\,= c^Ta \cr df &= a^Tdc + c^Tda \cr }$$ Now suppose you're told that $c$ is actually a function of $a$, i.e. $\,c=a.$
That's easy enough to handle. $$\eqalign{ df &= 2a^Tda\cr }$$ Now suppose you're told that $a$ itself is a function of $\beta$, i.e. $\,a=(X\beta-y)$
Again, this doesn't change things too much. $$\eqalign{ df &= 2a^TX\,d\beta \cr }$$ Now let's collect terms into a single vector $\,g=2X^Ta,\,$ substitute it into the expression, and isolate the gradient vector. $$\eqalign{ df &= g^Td\beta \cr \frac{\partial f}{\partial\beta} &= g = 2X^T(X\beta-y) \cr\cr }$$ The problem with your approach is that it assumes the existence of a rule $$ \frac{\partial(a^Tc)}{\partial\beta} = \Big(\frac{\partial a}{\partial\beta}\Big)^Tc + a^T\Big(\frac{\partial c}{\partial\beta}\Big) $$ which turns out to be false.

The correct rule is $$ \frac{\partial(a^Tc)}{\partial\beta} = \Big(\frac{\partial a}{\partial\beta}\Big)^Tc + \Big(\frac{\partial c}{\partial\beta}\Big)^Ta $$ or the transpose of this, depending on your preferred layout convention.

2
On

To illustrate the problem, let's use a simple function. $$\phi = x^TAx$$ Take its differential. $$d\phi = x^TAdx + dx^TAx$$ Transpose the 2nd term so we can factor out the $dx$. $$d\phi = x^TAdx + x^TA^Tdx = (x^TA + x^TA^T)\,dx $$ Collect terms into a single vector $g=(Ax+A^Tx)$ and write this as. $$d\phi = g^Tdx$$ Therefore $g^T$ is the gradient of this function.


Now let's attack the problem as you proposed. Proceeding rather loosely we get $$\frac{\partial\phi}{\partial x} = x^TA\frac{\partial x}{\partial x} + \frac{\partial x^T}{\partial x}Ax $$ Once again, you need to "transpose" that 2nd term in order to factor the expression. $$\eqalign{ \frac{\partial\phi}{\partial x} &= x^TA\frac{\partial x}{\partial x} + x^TA^T\frac{\partial x}{\partial x} \cr &= \Big(x^TA + x^TA^T\Big)\frac{\partial x}{\partial x} \cr &= x^TA + x^TA^T \cr }$$ The reason I quoted the word transpose is because $$\frac{\partial x^T}{\partial x}\ne\bigg(\frac{\partial x}{\partial x}\bigg)^T$$ In fact, the term on the RHS is the identity matrix which equals its transpose (i.e. $I^T=I$), while the term on the LHS does not exist $-$ and this is the fatal flaw of your method.