I would like to ask about the steps of the following Simplification. Can anyone show me that how Step 1 simplify to Step 2 ?
Step 1: $$\frac{\partial J_{\theta}}{\partial \theta}=\frac{\partial}{\partial \theta}\left[(X \theta-y)^{T}(X \theta-y)\right]$$ Step 2: $$\frac{\partial J_{\theta}}{\partial \theta}=2 X^{T} X \theta-2 X^{T} y$$
Here, $X$ is a matrix, and $\theta$, $y$ are vectors.
And is there any special meaning of a Matrix multiply by it's Tranpose ? Thank you.
I assume you're working under the conditions that $X\in\mathbb{R}^{m\times n}$, $\theta\in\mathbb{R}^n$, and $y\in\mathbb{R}^m$. Define $J(\theta) = \|X\theta-y\|_2^2 = (X\theta - y)^\top(X\theta-y)$. Expanding, we find that \begin{equation*} J(\theta) = (\theta^\top X^\top - y^\top)(X\theta-y) = \theta^\top X^\top X\theta - \theta^\top X^\top y - y^\top X\theta + y^\top y. \end{equation*} Since $\theta^\top X^\top y$ is a scalar, it equals its transpose, i.e., $\theta^\top X^\top y = y^\top X\theta$. Therefore, \begin{equation*} J(\theta) = \theta^\top X^\top X \theta - 2y^\top X\theta + y^\top y. \end{equation*}
Now, the gradient of the linear function $g(\theta) = a^\top \theta$ is $\nabla g(\theta) = a$, since $\frac{\partial g}{\partial \theta_i}(\theta) = a_i$ for all $i\in\{1,2,\dots,n\}$. Once again taking partial derivatives, you find that the gradient of the quadratic function $h(\theta) = \theta^\top Q\theta$ is $\nabla h(\theta) = (Q+Q^\top)\theta$. In the context of your problem, these gradient formulas yield \begin{equation*} \nabla J(\theta) = (X^\top X + (X^\top X)^\top)\theta -2 X^\top y = 2X^\top X\theta - 2X^\top y. \end{equation*}
One interpretation of a matrix multiplied by its transpose comes from statistics and machine learning. In particular, suppose $\{y_1,y_2,\dots,y_m\}\subseteq\mathbb{R}^n$ represents a sample of $m$ data points. Center the data so that $x_i = y_i - \bar{y}$ where $\bar{y} = \frac{1}{m}\sum_{i=1}^m y_i$. Then we can form a data matrix $X\in\mathbb{R}^{m\times n}$ by stacking the centered data into rows: \begin{equation*} X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_m^\top \end{bmatrix}. \end{equation*} In this case, we have that \begin{equation*} \frac{1}{m}X^\top X = \frac{1}{m}\sum_{i=1}^m x_i x_i^\top \end{equation*} is the sample covariance matrix of the data. Another interpretation of $X^\top X$ is that it is the Gram matrix generated by the columns of $X$.