I was trying to follow along with the derivation of the normal equations, and almost everything here makes sense. However, when it came to the differentiation, I got stuck at how one goes from $$S(\beta) = y^Ty-2\beta^TX^Ty + \beta^TX^TX\beta$$
to
$$-X^Ty+(X^TX)\beta=0$$
just by computing $\frac{\partial S(\beta)}{\partial \beta} = 0$. I'm not very familiar with matrix calculus but I do want to understand how each step in this derivation works and this is the main part I'm getting stuck on. I speculate that one can cancel $\beta^T$ from both terms because it's equated to 0, but I can't see how the $2$ drops off.
Take a look here https://en.wikipedia.org/wiki/Matrix_calculus for details on matrix calculus. In a nutshell, $ \beta ' X'X \beta$ is a quadratic form with a symmetric matrix $X'X$, hence $$ \frac{\partial \beta ' X'X \beta}{\partial \beta} = 2 X'X \beta, $$ and $\beta ' X'y$ is of the form $\beta ' a$, where $a$ is $p \times 1$ vector, hence $$ \frac{\partial \beta ' X'y}{\partial \beta} = X'y. $$
To prove these formulas just expand explicitly the expressions and take partial derivatives w.r.t. to each $\beta_j$. After that, return to a matrix representation and you'll get the result.