In the book Hands-On Machine Learning with Scikit-Learn & TensorFlow, the author only showed the formula for the Batch Gradient Descent method, such as:
$ \dfrac{\partial}{\partial \theta_{j}} MSE(\theta)= \dfrac{2}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})\cdot x^{(i)}$
So that the gradient vector of the cost function is: $\bigtriangledown_{\theta}MSE(\theta) = \begin{bmatrix} \dfrac{\partial}{\partial \theta_{0}} MSE(\theta_0) \\ \dfrac{\partial}{\partial \theta_{1}} MSE(\theta_1) \\ \dfrac{\partial}{\partial \theta_{2}} MSE(\theta_2) \\ \vdots \\ \dfrac{\partial}{\partial \theta_{n}} MSE(\theta_n) \end{bmatrix} = \dfrac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$
The MSE cost function is defined as: $MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})^2$
Is there anyway who could kindly step by step show me the proof of the cost function's gradient vector formula (using linear algebra) above?
The cost function is given by
$$J = \dfrac{1}{N}\sum_{n=1}^{N}\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]^2.$$
Take the total derivative
$$dJ = \dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]d\boldsymbol{w}^T\boldsymbol{x}_n \}.$$
As $d\boldsymbol{w}^T$ is not dependent on the summation index $n$ we can pull it out of the sum. We can put it in front of $ \left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]$ because it is a scalar. Hence we obtain
$$dJ = d\boldsymbol{w}^T\left[\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}\right].$$
Now, we know that the term in the bracket is the gradient of $J$ with respect to $\boldsymbol{w}$. Hence,
$$\text{grad}_{\boldsymbol{w}}J=\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}.$$
The explanation for gradient and total derivative relationship.
Let $J(\boldsymbol{w})=J(w_0,w_1,...,w_m)$ be a multivariate function. The total derivative of $J$ is given by
$$dJ = \dfrac{\partial J}{\partial w_0}dw_0+\dfrac{\partial J}{\partial w_1}dw_1+\ldots+\dfrac{\partial J}{\partial w_m}dw_m$$ $$=[dw_0, dw_1,\ldots, dw_m][\dfrac{\partial J}{\partial w_0},\dfrac{\partial J}{\partial w_1},\ldots,\dfrac{\partial J}{\partial w_m}]^T$$ $$=d\boldsymbol{w}^T\text{grad}_{\boldsymbol{w}}J.$$