I am learning linear regression. I came across following:
We need to find weights $w$ to reduce error function. So, $$w^*=\arg \min_w{E(w,\mathcal{D})}=\arg \min_w\sum_{i=0}^n(y_i-w^Tx_i)^2 $$ whre $(y_i-w^Tx_i)^2$ is squared error function, $\mathcal{D}$ is training data and $n$ is number of samples in training data.
Solve for $w$ by setting $\nabla_wE=\nabla_w\sum_{i=0}^n(y_i-w^Tx_i)^2=0$
$$\nabla_wE=-2\sum_iy_ix_i+2\color{blue}{\sum_i(w^Tx_i)x_i}=0$$
$$\nabla_wE=-2X^Ty+2\color{blue}{X^TXw}=0\Longrightarrow w^*=\color{red}{(X^TX)^{-1}X^Ty}$$
Q1. I am not able to understand how those two blue equations are same? (First one is in summation form, whereas second blue equation considers whole set of $x_i$'s as matrix $X$).
Q2. Also I did not understand how the red colored matrix form is achieved.
$X$ is a matrix whose $i$th row is $x_i^\top$.
The sum $\sum_i (w^\top x_i) x_i$ is a linear combination of vectors $x_i$ with coefficients $(w^\top x_i)$; such a linear combination can be written as $X^\top v$ (note that the columns of $X^\top$ are $x_1, \ldots, x_n$) where $v$ is a vector whose $i$th entry is $w^\top x_i$. (If this is not clear to you, think about how in general the matrix multiplication $Av$, where $A$ is a matrix and $v$ is a vector, is a linear combination of the columns of $A$.)
Finally, the vector $v$ can be written as $Xw$; just check that the $i$th entry of $Xw$ is precisely $x_i^\top w$.
The last equation can be rearranged to $X^\top y = X^\top X w$. Multiplying both sides on the left by $(X^\top X)^{-1}$ yields the red equation.