In a physical application, I'm varying the $40$ elements of a $10\times 4$ real matrix numerically. Every iteration, I'm using Gram-Schmidt orthogonalization to maintain orthonormality between the rows.
Of course, I've realized that double work is being done, keeping track of and varying $40$ elements and then discarding some of them for orthonormality, not to speak of what this actually does to the concept of gradient descent. Since there is one equation governing the first row, two governing the second and so on, there are actually $M\cdot N - \frac{N^2-N}{2} = 30$ degrees of freedom.
I tried to go the route through geometry, storing $30$ angles $\theta_l$ and expressing the rows as orthogonal points on the unit $40$-sphere. This led me to two things: Group theory, of which I know nothing, and a headache, of which I know more than I want.
I'm (apparently) looking for the first $N$ rows of an $\mathrm{SO}(M)$ matrix. Is there a reliable way to generate this matrix from $M\cdot N - \frac{N^2-N}{2}$ parameters?
It sounds like what you really want is gradient descent on Stiefel manifolds. Given $X,T\in\mathbb R^{N\times M}$ with $T$ tangent to $X$ in the Stiefel manifold, i.e. $XT^t$ antisymmetric, then $X\exp(-\gamma(X^tT-T^tX)/2)$ is a point on the Stiefel manifold in the direction of $-T.$ There is a cheaper approximation that still stays on the Stiefel manifold but avoids matrix exponentials - see http://noodle.med.yale.edu/hdtag/notes/steifel_notes.pdf
The tangent space $\{T\mid XT^t+TX^t=0\}$ is a 30-dimensional vector space; you could in principle compute a basis for any particular $X.$ That's probably not useful though.