How to detailed derivation of multiplicative update rules for Nonnegative Matrix Factorization?
Minimize $\left \| V - WH \right \|^2$ with respect to $W$ and $H$, subject to the constraints $W,H \geq 0.$ The multiplicative update rules are as follows:
\begin{equation} W_{i,j} \leftarrow W_{ij} \frac{(VH^T)_{ij}}{(WHH^T)_{ij}} \end{equation}
\begin{equation} H_{i,j} \leftarrow H_{ij} \frac{(W^TV)_{ij}}{(W^TWH)_{ij}} \end{equation}
the Lagrange $\mathcal{L}$ is: $$\mathcal{L}(W,H) =\left \| V - WH \right \|^2-Tr(\Psi W^T)-Tr(\Phi H^T)$$
The derivatives with respect to H can computed similarly. Thus,
$$\nabla_W f(W,H) = -2VH^T + 2WHH^T-\Psi$$ $$\nabla_H f(W,H) = -2W^TV + 2W^TWH- \Phi$$
According to KKT conditions, $\Psi_{ij}W_{ij}=0$ and $\Phi_{ij}H_{ij}=0$:
$$(-2VH^T + 2WHH^T)\circ W=0$$ $$(-2W^TV + 2W^TWH)\circ H=0$$
The question is:
Why the results are as follows: \begin{equation} W_{i,j} \leftarrow W_{ij} \frac{(VH^T)_{ij}}{(WHH^T)_{ij}} \end{equation} \begin{equation} H_{i,j} \leftarrow H_{ij} \frac{(W^TV)_{ij}}{(W^TWH)_{ij}} \end{equation}
Not are as follows: \begin{equation} W_{i,j} \leftarrow W_{ij} \frac{(WHH^T)_{ij} }{(VH^T)_{ij}} \end{equation} \begin{equation} H_{i,j} \leftarrow H_{ij} \frac{(W^TWH)_{ij}}{(W^TV)_{ij}} \end{equation}
In addition, Why should the learning rate be set like as follows: $\eta_W = \frac{W}{WHH^T}$ and $\eta_H = \frac{H}{W^TWH}$ Click here, you can see this paper.
Note that near a solution point, the Hadamard fraction is roughly equal to the all-ones matrix $$\left(\frac{W^TV}{W^TWH}\right) \approx {\tt\large 1}$$ If the elements of $H$ are too big, then the elements of the Hadamard fraction are less than unity (since $H$ appears in the denominator). If they're too small then the fractional elements become greater than unity.
This self-correcting behavior is exactly what is required for a convergent iterative method, i.e. $$\eqalign{ H_+ = H \odot\left(\frac{W^TV}{W^TWH}\right) \\ }$$ On the other hand, the reciprocal fraction $$\left(\frac{W^TWH}{W^TV}\right) \approx {\tt\large 1}$$ has the opposite behavior, i.e. large elements in $H$ produce a Hadamard fraction with elements greater than unity, leading to even larger elements in $H$ in the next iteration, and eventually to divergence.