I am deriving the vector form of the linear regression normal equation, my current working is below.
$$ L(\mathbf{\theta}) = \sum_{n=0}^N(y_n - \hat{y}_n)^2 = (\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}}) = (\mathbf{y} - \mathbf{X}\mathbf{\theta})^T(\mathbf{y} - \mathbf{X}\mathbf{\theta}) $$ $$ L(\mathbf{\theta}) = (\mathbf{X}\mathbf{\theta})^T\mathbf{X}\mathbf{\theta} - (\mathbf{X}\mathbf{\theta})^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{\theta} + \mathbf{y}^T\mathbf{y} = \mathbf{\theta}^T\mathbf{X}^T\mathbf{X}\mathbf{\theta} - \mathbf{\theta}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{\theta} + \mathbf{y}^T\mathbf{y} $$
I then need to differentiate this with respect to $\mathbf{\theta}$. I believe that the following terms differentiate as follows:
$$ \frac{\mathrm{d}}{\mathrm{d}\mathbf{\theta}}\left( - \mathbf{\theta}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{\theta} + \mathbf{y}^T\mathbf{y} \right) = -\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X} + 0 = -2\mathbf{X}^T\mathbf{y} $$
However I am not sure of the logic involved in differentiating the term $\mathbf{\theta}^T\mathbf{X}^T\mathbf{X}\mathbf{\theta}$. My thoughts are that you assume that you can commutate the multiplication of $\mathbf{\theta}^T$ to obtain $\mathbf{X}^T\mathbf{X}\mathbf{\theta}^T\mathbf{\theta}$ and then differentiate to get $2\mathbf{X}^T\mathbf{X}\mathbf{\theta}$, although I am not sure of how to show that this is the case. How would you differentiate this?
$ \def\t{\theta}\def\p{\partial} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $You'd be better off doing the $\c{\rm differentiation}$ first and the substitution $(\hat y\to\t)$ last: $$\eqalign{ \hat y &= X\t \quad\qiq d\hat y = X\,d\t\\ w &= \hat y-y \qiq dw = d\hat y\\ \c{L} &\c{=} \c{w^Tw} \\ \c{dL} &\c{=} \c{2w^Tdw} \;= 2w^TX\,d\t \;= (2X^Tw)^T d\t \\ \grad{L}{\t} &= 2X^Tw \;= 2X^T(X\t-y) \\ }$$