I'm reading this paper for a university project, and I'm currently stuck on part B.2 of section 3 where the goal is to minimize $$ \frac{1}{N}\sum_ne^{-\rho y_nf(V;x_n)}, $$ with respect to the weights $V_k$ of layer $k$, where $f(V;x_n) = \sigma(V^K\sigma(V^{K-1}\dots\sigma(V^1x_n)))$ is the realization of a deep neural network on the observation $x_n$ which has label $y_n\in \{-1,1\}$, with ReLU activations $\sigma$, and such that $V$ is the vector of weights $V_k$ for layer $k$, under the constraint $\|V_k\| = 1$, where $\|\|$ is the $L_2$ matrix norm.
The part I'm stuck at is where the author claims that gradient descent on the lagrangian $$ \mathfrak{L} = \frac{1}{N}\sum_ne^{-\rho y_nf(V;x_n)}+\sum_k\lambda_k \|V_k\|^2 $$ yields the dynamical system $$ \dot{V_k} = \rho\frac{1}{N}\sum_ne^{-\rho y_nf(V;x_n)}y_n\left(\frac{\partial f(V;x_n)}{\partial V_k}-V_kf(V;x_n)\right). $$
The justification is that the this happens
because $\lambda_k = \frac{1}{2}\rho\frac{1}{N}\sum_ne^{-\rho y_nf(V;x_n)}y_nf(V;x_n)$, since $V_k^{T}\dot{V_k} = 0$, which in turn follows from $\|V_k\|^2 = 1$.
My problem is that I don't know why $\lambda_k$ should have that value. I tried to calculate the gradients myself but found out I don't have clear what the gradient of the matrix norm $\|V_k\|^2$ should be. I would very much appreciate some advice on how to proceed to understand, or a clarification as to where I'm confused.