Matrix calculus in multiple linear regression OLS estimate derivation

3.9k Views Asked by At

The steps of the following derivation are from here

Starting from $y= Xb +\epsilon $, which really is just the same as

$\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{bmatrix} = \begin{bmatrix} 1 & x_{21} & \cdots & x_{K1} \\ 1 & x_{22} & \cdots & x_{K2} \\ \vdots & \ddots & \ddots & \vdots \\ 1 & x_{2N} & \cdots & x_{KN} \end{bmatrix} * \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{K} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{N} \end{bmatrix} $

it all comes down to minimzing $e'e$:

$\epsilon'\epsilon = \begin{bmatrix} e_{1} & e_{2} & \cdots & e_{N} \\ \end{bmatrix} \begin{bmatrix} e_{1} \\ e_{2} \\ \vdots \\ e_{N} \end{bmatrix} = \sum_{i=1}^{N}e_{i}^{2} $

So minimizing $e'e'$ gives us:

$min_{b}$ $e'e = (y-Xb)'(y-Xb)$

$min_{b}$ $e'e = y'y - 2b'X'y + b'X'Xb$

(*) $\frac{\partial(e'e)}{\partial b} = -2X'y + 2X'Xb \stackrel{!}{=} 0$

$X'Xb=X'y$

$b=(X'X)^{-1}X'y$

I'm pretty new to matrix calculus, so I was a bit confused about (*).

In step (*), $\frac{\partial(y'y)}{\partial b} = 0$, which makes sense. And then $\frac{\partial(-2b'X'y)}{\partial b} = -2X'y$, but why exactly is this true? If it were $\frac{\partial(-2b'X'y)}{\partial b'}$, then that would make perfect sense to me. Is taking the partial derivative with respect to $b$ the same as taking the partial derivative with respect to $b'$?

Similarly, $\frac{\partial(b'X'Xb)}{\partial b} = X'Xb$ Why is this true? Shouldn't it be $= b'X'X$?

2

There are 2 best solutions below

0
On BEST ANSWER

This is not exaclty a proof but rather a way to think about it.

You are trying to minimize a scalar function $F(b)$. Now use the implicit derivative:

$$dF=d(y'y)-2d(b'X'y)+d(b'X'Xb)=-2db'X'y+db'X'Xb+b'X'Xdb.$$

Now transpose the last expression (which is a scalar) and factor $db'$.

$$dF=2db'(-X'y+X'Xb)$$

So the gradient of $F(b)$ is $2(-X'y+X'Xb)$. Set this to zero and solve for $b$. This procedure is sometimes also called the external definition of the gradient.

0
On

Consider the full matrix case of the regression $$\eqalign{ Y &= XB+E \cr E &= Y-XB \cr }$$ In this case the function to be minimized is $$\eqalign{f &= \|E\|^2_F = E:E}$$ where colon represents the Frobenius Inner Product.

Now find the differential and gradient $$\eqalign{ df &= 2\,E:dE \cr &= -2\,E:X\,dB \cr &= 2\,(XB-Y):X\,dB \cr &= 2\,X^T(XB-Y):dB \cr\cr \frac{\partial f}{\partial B} &= 2\,X^T(XB-Y) \cr }$$ Set the gradient to zero and solve $$\eqalign{ X^TXB &= X^TY \cr B &= (X^TX)^{-1}X^TY \cr }$$ This result remains valid when $B$ is an $(N\times 1)$ matrix, i.e. a vector.

The problem is that, in the vector case, people tend to write the function in terms of the transpose product instead of the inner product, and then fall into rabbit holes concerning the details of the transpositions.