The derivation can be found on wikipedia but it's not clear how each step follows.
We have $y=X\beta+\epsilon$, and want to minimize $\epsilon^2$. We write objective function as $S(\beta)=||y-X\beta||^2=y^Ty-y^TX\beta-\beta^TX^Ty+\beta^TX^TX\beta=y^Ty-2\beta X^Ty+\beta^TX^T X\beta $. This follows by a dimension argument, so we combine the two middle terms. Now I don't understand how the derivative is taken, since the derivation proceeds to partial derivative with respect of $\beta$, yielding $-X^Ty+X^T X\beta=0$
In the last step, what happened to the $2$? And why did $\beta^T$ disappear but the $\beta$ remain? I can guess that $-2X^Ty+2(X^tX)\beta=0$. But specifically how to take the partial derivative without respect to $\beta$ of $\beta^TX^TX \beta$?
By Eq. 69 in the Matrix Cookbook (p. 10)
$\frac{\partial}{\partial\beta}(\beta^TX^Ty) = X^Ty.$
By Eq. 81 (p. 11)
$\frac{\partial}{\partial\beta}(\beta^TX^TX\beta) = (X^TX + (X^TX)^T)\beta = 2X^TX\beta.$
So you are right, there is a factor of 2:
$\frac{\partial}{\partial\beta}(y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta) = 0 - 2X^Ty + 2X^TX\beta.$