I am reading a paper on weight normalization authored by Salimans & Kingma. In this paper, the gradient is split into two parts where $g$ is the norm of $w$ and $v$ is the direction of gradient, the gradient of $v$ calculated as
$$\nabla_v L = \frac{g}{\|v\|} M \nabla_wL$$
where $ M := I - \frac{ww^T}{\|w\|^2} $
Because $\Delta v \propto \nabla_vL (\text{steepest descent/ascent})$, then $\Delta v$ is >necessarily orthogonal to weight $w$ since $M$ project it away from calculating $\nabla_v L$.
I dont quite get this sentence, why $\Delta v$ must 'necessarily' orthogonal to weight $w$? Is projection matrix $M$ any special form?
After going through the paper to understand notation, The update rule for $v$ is,
$$ v' = v + \Delta v $$
Since, we do vanialla gradient descent (referred as steepest descent in the paper) for optimization. Assuming the step size to be $\eta$,
$$\Delta v = -\eta*\nabla_vL= -\eta*\frac{g}{||v||} M \nabla_w L$$
$\Delta v$ is orthogonal to weight $w$ because $M$ is a projection matrix that projects onto the complement of the $w$ vector. Mathematically,
$$ w^T\Delta v=-\eta\frac{g}{||v||}w^T\left( I - \frac{ww^T}{||w||^2}\right)\nabla_wL $$
$$ w^T\Delta v=-\eta\frac{g}{||v||}\left( w^T - \frac{||w||^2w^T}{||w||^2}\right)\nabla_wL=0 $$
Hence, $\Delta v$ is orthogonal to the original weight vector $w$.