I am stuck in the evaluation of closed form solution to find $\theta$, why $(X^{T}*X)^{-1}*X^{T}*Y$ is used and not $(X*X^{T})$?

132 Views Asked by At

$$J(\theta)=\frac{1}{n}\sum_{i=1}^{n} (y^{i}-x^{i}\theta)^2$$

where $y^{i}$ is actual value and $x^{i}\theta$ is predicted for each i.

Differentiating wrt $\theta$ and equating to $0$ (to find $\theta$ that minimizes the error ) will give:

$\dfrac{d(J(\theta))}{d\theta}=\dfrac{2}{n}(Y-X\theta)(-X)=0$

where $X$ and $Y$ are set of values of features and actual values respectively.

$$\Rightarrow -YX + XX\theta=0\;\Rightarrow\; XX\theta = YX\;\Rightarrow\;\theta= (XX)^{-1}YX$$

This should be the the final solution. But as $X$ and $Y$ are matrices. The following is used:

$$\Rightarrow\theta=(X^{T}X)^{-1}X^{T}Y$$

why can't we use $XX^{T}$ and $YX^{T}$ or taking the transpose of $Y$ and using different arrangement.

I am stuck at this. Please explain.

In a Machine Learning course on Edx by MIT a professor came with the following final solution:

$\hat{\theta}$ = $A^{-1}b$

where $\displaystyle A=\dfrac{1}{n}\sum_{i=1}^{n} x^{i}(x^{i})^{T}\hat{\theta}$ and $\displaystyle b = \dfrac{-1}{n}\sum_{i=1}^{n} y^{i}x^{i}$

Thank You.

1

There are 1 best solutions below

4
On BEST ANSWER

To square matrices you build a product of a matrix. But you have to transpose one. Otherwise you get problems with the number of columns and the number of rows: $M^T_{m \times n}\cdot M_{m\times n}=P_{\color{green}n\times \color{red}m}\cdot M_{\color{red}m\times \color{green}n}$. $P$ is the transposed matrix of $M$

$J(\theta)=(Y-X\theta)^T\cdot (Y-X\theta)=(Y^{T}-\theta^{T}X^{T})\cdot (Y-X\theta)$

Multiplying out:

$J(\theta)=Y^TY-\theta ^T \mathbf{X} ^TY-Y^TX\theta+\theta^TX^T X\theta $

$\theta ^T \mathbf{X} ^TY$ and $Y^TX\theta$ are equal. Thus

$J(\theta)=Y^TY-2\theta ^T \mathbf{X} ^TY+\theta^TX^T X\theta$

Now you can differentiate with respect to $\theta$:

$\frac{\partial J(\theta)}{\partial \theta}=-2X^TY+2X^T X\theta=0$

I leave the rest for you.