Suppose we need to find the derivative of $$ \dfrac{d||(Xw)^T||_2}{dw} $$ where X is an $n \times m$ matrix and $w$ is of dimension $m \times 1$.
I know I need to apply the chain rule but I am confused on how to work when I need to work with both a norm and the transpose. Of course I can rewrite it as $$ \dfrac{d||w^TX^T||_2}{dw} $$ but then I get $$ \dfrac{d||w^TX^T||_2}{dw} = \dfrac{d(||w^TX^T||_2)}{d(w^TX^T)} \dfrac{d(w^TX^T)}{dw} = \dfrac{Xw}{||w^TX^T||_2} X^T$$
Which is obviously wrong because the multiplication $XwX^T$ is impossible. Where am I going wrong?
Restate the problem $$\eqalign{ y &= Xw & \quad({\rm a\,convenient\,vector}) \\ \phi &= \|y^T\|_2 &= \|y\|_2 \quad({\rm the\,function}) \\ \phi^2 &= \|y\|^2_2 &= y^Ty \quad(\ldots{\rm squared}) \\ }$$ Starting with the squared function, calculate the differential, then the gradient. $$\eqalign{ 2\phi\,d\phi &= 2y^Tdy \;=\; 2y^TX\,dw \;=\; 2(X^Ty)^Tdw \\ d\phi &= \left(\frac{X^Ty}{\phi}\right)^Tdw \\ \frac{\partial\phi}{\partial w} &= \frac{X^Ty}{\phi} = \frac{X^TXw}{\|Xw\|_2} \\ }$$