I am looking at a paper for non-negative matrix factorization and can't seem to figure out the derivation for the gradient. The function is as follows:
$f(W,H) = \frac{1}{2}||V-WH ||^2_F$
Where V is fixed.
The resulting gradients are:
$\nabla_W f(W,H) = (WH-V)H^T $
$\nabla_H f(W,H) = W^T(WH-V) $
I tried looking up rules in various reference manuals but never got anything solid. Even a hint / solid resource would be incredibly helpful.
I find it's often easiest to compute matrix-derivatives in these terms, then translate to whatever the desired means of expressing the derivative is. That is, I prefer to think of $\nabla_W f$ as a map from $\Bbb R^{m \times n}$ to the set of linear functionals on $\Bbb R^{m \times n}$.
We find that $$ \DeclareMathOperator{\tr}{tr} f(W + \delta W, H) = \frac 12 \tr[(V - (W + \delta W)H)(V - (W + \delta W)H)^T] =\\ \frac 12 \tr[[(V - WH)- \delta WH][(V - WH) - \delta WH]^T] =\\ f(W,H) + \tr[(WH - V)(\delta W H)^T] + o(\|\delta W\|) =\\ f(W,H) + \tr[[(WH - V)H^T](\delta W)^T] + o(\|\delta W\|) $$ Thus, I would say that $$ [D_W f(W,H)](\delta W) = \tr[[(WH - V)H^T]^T(\delta W)] $$ Translating this to your matrix calculus in which $\nabla_W$ should be a matrix, we find that $$ \nabla _W f(W,H) = (WH - V)H^T $$ as desired.