$$J(\theta)=\frac{1}{n}\sum_{i=1}^{n} (y^{i}-x^{i}\theta)^2$$
where $y^{i}$ is actual value and $x^{i}\theta$ is predicted for each i.
Differentiating wrt $\theta$ and equating to $0$ (to find $\theta$ that minimizes the error ) will give:
$\dfrac{d(J(\theta))}{d\theta}=\dfrac{2}{n}(Y-X\theta)(-X)=0$
where $X$ and $Y$ are set of values of features and actual values respectively.
$$\Rightarrow -YX + XX\theta=0\;\Rightarrow\; XX\theta = YX\;\Rightarrow\;\theta= (XX)^{-1}YX$$
This should be the the final solution. But as $X$ and $Y$ are matrices. The following is used:
$$\Rightarrow\theta=(X^{T}X)^{-1}X^{T}Y$$
why can't we use $XX^{T}$ and $YX^{T}$ or taking the transpose of $Y$ and using different arrangement.
I am stuck at this. Please explain.
In a Machine Learning course on Edx by MIT a professor came with the following final solution:
$\hat{\theta}$ = $A^{-1}b$
where $\displaystyle A=\dfrac{1}{n}\sum_{i=1}^{n} x^{i}(x^{i})^{T}\hat{\theta}$ and $\displaystyle b = \dfrac{-1}{n}\sum_{i=1}^{n} y^{i}x^{i}$
Thank You.
To square matrices you build a product of a matrix. But you have to transpose one. Otherwise you get problems with the number of columns and the number of rows: $M^T_{m \times n}\cdot M_{m\times n}=P_{\color{green}n\times \color{red}m}\cdot M_{\color{red}m\times \color{green}n}$. $P$ is the transposed matrix of $M$
$J(\theta)=(Y-X\theta)^T\cdot (Y-X\theta)=(Y^{T}-\theta^{T}X^{T})\cdot (Y-X\theta)$
Multiplying out:
$J(\theta)=Y^TY-\theta ^T \mathbf{X} ^TY-Y^TX\theta+\theta^TX^T X\theta $
$\theta ^T \mathbf{X} ^TY$ and $Y^TX\theta$ are equal. Thus
$J(\theta)=Y^TY-2\theta ^T \mathbf{X} ^TY+\theta^TX^T X\theta$
Now you can differentiate with respect to $\theta$:
$\frac{\partial J(\theta)}{\partial \theta}=-2X^TY+2X^T X\theta=0$
I leave the rest for you.