I want to optimize a function $E = \sum_i(Y_i-f(Z_i))^2$ w.r.t. $T$ where $f(Z)=ZTV^t$, and all are matrices (not necessarily square). Here $T$ acts as weights as in regression for example. $Y \in R^{n \times p}$, $Z \in R^{n \times k}$, $T \in R^{k \times k}$, and $V \in R^{p \times k}$. $p,n,k$ are arbitrary integers ( $>1$).
How can I do this?
From a gradient based algorithm point of view, do you need to compute the $\partial E/\partial T_{a,b}$ where $T_{a,b}$ is an element of $T$? If so, how can you do that and after computing, do you just simultaneously perform $T_{a,b} = T_{a,b} - \alpha * \partial E/\partial T_{a,b} $ for all $a,b$ and learning rate $\alpha$ as in the usual gradient descent?
($A^t$ denotes the transpose of matrix $A$)
Let's use a colon to denote the trace/Frobenius product $$A:B={\rm tr}(A^TB)$$ and let's define the additional matrix variables $$\eqalign{ \def\L{\left} \def\R{\right} \def\qiq{\quad\implies\quad} F &= ZTV^T &\qiq dF = Z\,dT\,V^T \\ M &= \L(F-Y\R) &\qiq dM = dF \\ }$$ Write the function in terms of these new variables, then find its differential and gradient $$\eqalign{ {\mathcal E} &= M:M \\\\ d{\mathcal E} &= 2M:dM \\&= 2M:Z\,dT\,V^T \\ &= 2Z^TMV:dT \\ \\ \frac{\partial{\mathcal E}}{\partial T} &= 2Z^TMV \\ &= 2Z^T \L(F-Y\R) V \\ }$$ Set the gradient to zero and solve $$\eqalign{ Z^TYV &= Z^TFV = Z^T \L(ZTV^T\R) V \\ \\ T &= \L(Z^TZ\R)^{-1} \L(Z^TYV\R) \L(V^TV\R)^{-1} \\ &= Z^+Y\L(V^+\R)^T \\ }$$ where $M^+$ denotes the pseudoinverse of $M$.