The basic setup in multiple linear regression model is
\begin{align} Y &= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} \end{align}
\begin{align} X &= \begin{bmatrix} 1 & x_{11} & \dots & x_{1k}\\ 1 &x_{21} & \dots & x_{2k}\\ \vdots & \dots & \dots\\ 1 & x_{n1} & \dots & x_{nk} \end{bmatrix} \end{align}
\begin{align} \beta &= \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{k} \end{bmatrix} \end{align}
\begin{align} \epsilon &= \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix} \end{align}
The regression model is $Y=X \beta + \epsilon$.
To find least square estimator of $\beta$ vector, we need to minimize $S(\beta)=\Sigma_{i=1}^n \epsilon_i^2 = \epsilon ' \epsilon = (y-x\beta)'(y-x\beta)=y'y-2\beta 'x'y + \beta 'x'x \beta$
$$\frac{\partial S(\beta)}{\partial \beta}=0$$
My question: how to get $-2x'y+2x'x \beta$?
The sum of the squared errors can be written as
$$ \left\lVert \epsilon \right\rVert^2 = \left\lVert Y - X\beta \right \rVert^2 $$ $$ = (Y - X\beta)^T(Y - X\beta) = Y^TY - \beta^TX^TY - Y^TX\beta + \beta^TX^TX\beta $$ $$ = \left\lVert Y \right\rVert^2 - 2Y^TX\beta + \left\lVert X\beta \right\rVert^2 $$
Then, finding the gradient $\frac{d\left\lVert \epsilon \right\rVert^2}{d\beta}$
$$ \frac{d\left\lVert \epsilon \right\rVert^2}{d\beta} = -2X^TY + 2X^TX\beta $$
Reviewing term by term in that differentiation (since differentiation is linear operator!)
$\left\lVert Y \right\rVert^2$ does not depend on $\beta$ and becomes $0$.
$2Y^TX\beta$ is a is a sum where the $i^{th}$ term is $2\beta_iY^Tx_i$ where $x_i$ is the $i^{th}$ row of $X$. Therefore, the gradient of this term evaluates to $-2X^TY$.
$\left\lVert X\beta \right\rVert^2$ can be differentiated using the product rule
In general, the Jacobian of $f(x) = Ax $ is $ J_f = A^T$