Assume $X$ is a $n$ by $d$ matrix, $\alpha$ is a $n$ by $1$ vector, then
$$\frac{d\|X^T\alpha\|^2_2}{d\alpha}=\frac{d\|X^T\alpha\|^2_2}{dX^T\alpha}\frac{dX^T\alpha}{d\alpha}=2\alpha^T X X^T.$$
I was under the impression that the $L_2$ norm squared of a vector $\vec x$ is just $2\vec x$.
So why isn't $\dfrac{d\|X^{T}\alpha\|^{2}_{2}}{d X^{T}\alpha}$ just equal to $2(X^{T}\alpha)$?
I'm having difficulties figuring out why this expression is transposed.
Thanks for your help!
If $f:\mathbb R^n \to \mathbb R^m$ is differentiable at $x$ then $f'(x)$ is an $m \times n$ matrix. For example, if $f(x) = \|x\|^2$, then $f:\mathbb R^n \to \mathbb R$ and $f'(x) = 2 x^T$ is a $1 \times n$ matrix (a row vector). So, $f'(X^T \alpha) = 2 \alpha^T X$.
If $g(\alpha) = f(X^T \alpha) = \| X^T \alpha \|^2$ then by the chain rule $$ g'(\alpha) = f'( X^T \alpha ) X^T = 2 \alpha^T X X^T. $$