How is the derivative with respect to vector is taken in linear regression?

537 Views Asked by At

In the book I am studying the author motivates that the sum of the distances of data points to the fitted line can be written in matrix form as $$ (t-X\beta)^T(t-X\beta) $$ where X is a matrix that has an observation on each row, t are the column vector of corresponding target values, and $\beta$ are the column vector of parameters we are estimating.

So far, good. Then, to minimize this sum, we need to take the derivative with respect to $\beta$ and set to zero.

We get $$ X^T(t-X\beta)=0 $$ And that is too much a jump for me. I think I know basic algebra, but not matrix calculus. Can you detail the steps from first equation to the second?

I can go this far: $$ (t-X\beta)^T(t-X\beta) $$ $$ (t^T-\beta^TX^T)(t-X\beta) $$ $$ (t^Tt-t^TX\beta-\beta^TX^Tt+\beta^TX^TX\beta) $$

But I do not know how to take the derivative of the last line w.r.t. $\beta$.

Thanks.

2

There are 2 best solutions below

2
On BEST ANSWER

Here I list a few basic rules for matrix calculus which can be applied to a large amount of vector derivatives and also can be readily proved like coppper.hat's method. Assume $A$ is a constant matrix and $x$ is a vector, then the following is true

$$\begin{align}\nabla_xAx&=A\\\nabla_xx^TA&=A^T\\\nabla_xx^TAx&=x^TA+(Ax)^T=x^T(A+A^T)\qquad\text{(product rule)}\end{align}$$

Then for your question, the deduction is straightforward. $$\begin{align}&\nabla_\beta(t^Tt-t^TX\beta-\beta^TX^Tt+\beta^TX^TX\beta)\\=&-t^TX-(X^Tt)^T+\beta^T(X^TX+(X^TX)^T)\\=&2\beta^TX^TX-2t^TX\\=&2(X\beta-t)^TX\end{align}$$

0
On

Let $\phi(\beta) = (t-X\beta)^T(t-X\beta)$. Now consider \begin{eqnarray} \phi(\beta+h)&=& (t-X(\beta+h))^T(t-X(\beta+h)) \\ &=& t^Tt -2 t^TX(\beta+h) +(X(\beta+h))^T X(\beta+h) \\ &=& t^Tt -2 t^TX(\beta+h) + \beta^TX^TX\beta + 2\beta^T X^TXh +h^TX^TX h \\ &=& t^Tt -2 t^TX\beta- 2 t^TXh + \beta^TX^TX\beta + 2\beta^T X^TXh +h^TX^TX h \\ &=& t^Tt -2 t^TX\beta + \beta^TX^TX\beta - 2 t^TXh + 2\beta^T X^TXh +h^TX^TX h \\ &=& \phi(\beta) + 2(X \beta-t)^TXh + h^TX^TX h \\ \end{eqnarray} Note that $|h^TX^TX h| \le \|X\|^2 \|h\|^2$ (in particular, it is $o(\|h\|)$), and $\beta \to 2(X \beta-t)^TXh$ is linear, hence we obtain $D \phi(\beta) h = 2(X \beta-t)^TXh$, or we can write $D \phi(\beta) = 2(X \beta-t)^TX$.

Hence at a minimum we have $D \phi(\hat{\beta}) = 0$, which gives rise to the normal equations $(X \hat{\beta}-t)^TX = $, or equivalently, $X^T(X \hat{\beta} -t) = 0$.