I try to follow an example from linear algebra where the authors calculate the derivative of a error function E with respect to a weight vector w. The function looks like:
$$E=\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w}$$
Here:
$\mathbf{X}\epsilon\mathbb{R}^{nxm}$
$\mathbf{w}\epsilon\mathbb{R}^{m}$
Calculating the derivative with respect to $\mathbf{w}$ they receive:
$$\frac{\partial E}{\partial \mathbf{w}}= 2\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}$$
So because of the $\mathbf{w}^{T}\mathbf{w}$ we remain with $2\mathbf{w}^{T}$ and hence loosely speaking we dropped $\mathbf{w}$. What I wonder is: Is there a specific "vector derivative" rule which wants us to drop $\mathbf{w}$ and not $\mathbf{w}^{T}$? So why could we not receive: $$\frac{\partial E}{\partial \mathbf{w}}= 2\mathbf{X}^{T}\mathbf{X}\mathbf{w}$$
So is there a difference in the two results respectively is there a specific and generally applicable vector derivation rule or is it random and all that count is that the dimensions match in the end (As they do in both cases here). Thanks for any advice.
As suggested by @CWindolf it is indeed the Transpose of each other. We know that $\frac{\partial E}{\partial \mathbf{w}}$ must have the same dimension as $\mathbf{w}$. That is, the derivative must have the dimension $mx1$. This is only given for $2\mathbf{X}^{T}\mathbf{X}\mathbf{w}$ or equivalently for $2(\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X})^{T}$ since $(AB)^{T}=B^{T}A^{T}$. Hence:
$(\mathbf{w}^{T}(\mathbf{X}^{T}\mathbf{X}))^{T}=(\mathbf{X}^{T}\mathbf{X})^{T}\mathbf{w}$
Applying once more $(AB)^{T}=B^{T}A^{T}$ to $(\mathbf{X}^{T}\mathbf{X})^{T}$ and putting everything together yields:
$2\mathbf{X}^{T}\mathbf{X}\mathbf{w}$