In Elements of Statistical Learning, we differentiate $RSS(β) = (y - X\beta)^T (y - X\beta)$ (equation 2.4) w.r.t to $\beta$ to get $X^T(y - X\beta)$ (equation 2.5).
According to some(link), this is because $$(y - X\beta)^T(y - X\beta) = y^T y -2\beta^T X^T y+\beta^T X^T X \beta$$ I understand how differentiating from here would lead to equation 2.5. What I don't understand is where the $-2\beta^T X^T y$ term comes from. Don't we get $-X^T\beta^T y$ and $-X\beta y$ multiplying out $(y - X\beta β)^T(y - X\beta)$? Combining those terms shouldn't result in $-2β^T X^T y$ given that $X$ isn't guaranteed to be symmetric right?
Noting that in this context, $y$ is a $N \times 1$ column vector, $\beta$ is a $(p + 1) \times 1$ column vector, and $X$ is an $N \times (p+1)$ matrix, so $\beta^T X^T y$ is a scalar value, and as such is guaranteed to be symmetric regardless of $X$.