I am trying to analytically and numerically compute the derivative of the following function
$$ J(W) = \frac{1}{2}\|W^T R W - I\|_F^2 $$
From a paper that I am reading, the derivative of this function
$$ Q(W) = \frac{1}{2}\|W^T W - I\|_F^2 $$
w.r.t to W is
$$ \frac{\partial Q}{\partial W} = W(W^TW-I) $$
(Which is wrong, by the way, because there should be a constant 2 that multiplies everything). Going back to my problem, according to my calculations,
$$ \frac{\partial J}{\partial W} = R^TW(W^TW-I) $$
(of course, also in this case there should be a 2 multiplying the whole expression).
To compare the goodness of these derivatives, I am using a software to compute symbolic derivatives. I am using W with random values. Computing $\frac{\partial Q}{\partial W}$ with the analytical expression above and the software for symbolic differentiation, I get (almost) the same results (elementwise differences are approx $10^{-15}$).
With the $\frac{\partial Q}{\partial W}$, I instead obtain two different results between my analytical expression and the software that computes the derivatives for me.
To convince myself, I am treating the gradient matrix as an image and I plot the results here: Analytical vs Theano
As you can see, visually the results seem correct, but elementwise the results diverge a lot. For example, I took the item at location [50,50] and this is what I get
- Analytical expression: 1290.2355448
- Theano symbolic deriv: 1213.77161213
I dont't understand whether:
- I am wrong (and please, help me to fix the expression of my derivatives)
- Theano is computing something different (?)
- It is only a problem of numerical approximation (probably, but unlikely, since the first case with the same data gives exactly the same matrices)
Thanks to anyone that can give me any support
Define a new matrix variable $$A=W^TRW-I$$ Then write the function in terms of the inner/Frobenius product (denoted by a colon) and this new variable. In this form, the differential and gradient are simple to calculate $$\eqalign{ J &= \frac{1}{2}A:A \cr \cr dJ &= A:dA \cr&=A:(dW^TRW+W^TR\,dW) \cr &= AW^TR^T:dW^T + R^TWA:dW \cr &= (RWA^T+R^TWA):dW \cr \cr \frac{\partial J}{\partial W} &= RWA^T+R^TWA \cr &= RW(W^TRW-I)^T + R^TW(W^TRW-I) \cr \cr }$$