I have the following loss function.
$$||\theta - (X^T X)^{-1} X^T y||_2^2$$
$$X\space \text{ is a matrix, } \theta \text{ and } y \text{ are known vectors.}$$
I have another constraint for $X$, which is $X = f(\lambda)$ for some function $f$ that I didn't include here.
The idea is that I want to initialize $\lambda$ to some random vector, compute an $X$ with $X=f(\lambda)$, and then use gradient descent or some iterative method to minimize the loss function given above by updating $\lambda$ at each step. However, I am having trouble taking the gradient of this loss function that could be used in this iterative algorithm.
How would I do this?
Define some new variables $$\eqalign{ M &= (X^TX)^{-1}X^T \cr p &= My - \theta \cr }$$ and their differentials $$\eqalign{ dM &= (X^TX)^{-1}\,dX^T - (X^TX)^{-1}\,d(X^TX)\,(X^TX)^{-1}X^T \cr &= (X^TX)^{-1}\,dX^T - (X^TX)^{-1}\,dX^T\,XM - M\,dX\,M \cr dp &= dM\,y \cr }$$ Write the cost function in terms of these new variables.
Then find its differential and gradient. $$\eqalign{ \phi &= p:p \cr\cr d\phi &= p:dM\,y \cr &= py^T:dM \cr &= py^T:(X^TX)^{-1}\,dX^T - py^T:(X^TX)^{-1}\,dX^T\,XM - py^T:M\,dX\,M \cr &= (X^TX)^{-1}py^T:dX^T - (X^TX)^{-1}py^TM^TX^T:dX^T - M^Tpy^TM^T:dX \cr &= \Big(yp^T(X^TX)^{-1} - XMyp^T(X^TX)^{-1} - M^Tpy^TM^T\Big):dX \cr &= \Big(yp^T(X^TX)^{-1} - XMyp^T(X^TX)^{-1} - M^Tpy^TM^T\Big):\frac{\partial X}{\partial\lambda_k}\,d\lambda_k \cr\cr \frac{\partial\phi}{\partial \lambda_k} &= \Big(yp^T(X^TX)^{-1} - XMyp^T(X^TX)^{-1} - M^Tpy^TM^T\Big):\frac{\partial X}{\partial\lambda_k} \cr\cr }$$ The colon is a convenient product notation for the trace, i.e. $\,\,A:B={\rm Tr}(A^TB)$.
Rules for rearranging terms in a colon product follow from the properties of the trace.