I have been working on a problem for a review homework for a class on Machine Learning. The problem statement is: minimize $\sum_{j=0}^n \|W^T x_j - y_j\|_{2}^{2} + \lambda \|W\|_F^2$ with respect to $W$.
I have made some progress with setting the gradient to zero and solving for a closed form solution, I have found that $\nabla_{w} \lambda \|W\|_F^2 = 2\lambda W$. And I have worked out that: $\sum_{j=0}^n \|W^T x_j - y_j\|_2^2 = \sum_{j=0}^n (W^T x_j - y_j)^T (W^T x_j - y_j) = \sum_{j=0}^n (x_j^T W - y_j^T)(W^T x_j - y_j) = \sum_{j=0}^n (x_j^T WW^T x_j - x_j^T Wy_j - y_j^T W^T x_j + y_j^T y_j)$
But I can't seem to make any progress on that last term, obviously the last term vanishes but beyond that I'm stuck. Could anyone point me in the right direction, thanks!
Instead of indexed vectors, write the function in terms of the matrix $X$ whose columns are the $x_k$ vectors. Likewise, the vectors $y_k$ are the columns of the matrix $Y$.
For further convenience, define a new matrix variable $$\eqalign{ Z &= X^TW-Y^T \cr }$$ Write the function in terms of the Frobenius (:) Inner Product and these new variables. Then it is a snap to find the differential and gradient. $$\eqalign{ f &= Z:Z + \lambda W:W \cr\cr df &= 2Z:dZ + 2\lambda W:dW \cr &= 2Z:X^TdW + 2\lambda W:dW \cr &= 2(XZ + \lambda W):dW \cr &= 2(XX^TW - XY^T + \lambda W):dW \cr\cr \frac{\partial f}{\partial w} &= 2(XX^TW - XY^T + \lambda W) \cr\cr }$$ Setting the gradient to zero yields $$\eqalign{ (XX^T + \lambda I)\,W &= XY^T \cr W &= (XX^T + \lambda I)^{-1}XY^T \cr }$$