My dearest Exchangers,
Let me summarize two questions that I have. I need to compute a matrix gradient of the scalar function $$ J =\|A\hat x_k - x_{k+1}\|_2^2, $$ with respect to the 2-by-2 matrix $A$. First, what is the correct way to compute this gradient? I came up with the following nonsense: $$\begin{align} \begin{bmatrix}\frac{\partial J}{\partial a_1} & \frac{\partial J}{\partial a_2} \\ \frac{\partial J}{\partial a_3} & \frac{\partial J}{\partial a_4}\end{bmatrix} &= \begin{bmatrix} (\begin{bmatrix} a_1 & a_2 \end{bmatrix}-x_{k+1,1})\hat x_{k,1} & (\begin{bmatrix}a_1 & a_2\end{bmatrix}-x_{k+1,2})\hat x_{k,2} \\ (\begin{bmatrix}a_3 & a_4\end{bmatrix}-x_{k+1,1})\hat x_{k,1} & (\begin{bmatrix}a_3 & a_4\end{bmatrix}-x_{k+1,2})\hat x_{k,2} \end{bmatrix} \\ & = \begin{bmatrix} A \hat x_k - \begin{bmatrix}x_{k+1,1} \\ x_{k+1,1}\end{bmatrix} & A \hat x_k - \begin{bmatrix}x_{k+1,2} \\x_{k+1,2}\end{bmatrix}\end{bmatrix} \hat x_k \end{align} $$
But (if this is correct), I don't know how to further simplify this expression, in such a way that I only use the symbols that are used in $J$. With what I got now, it seems that I need to stack some elements manually into vectors. Obviously, when dimensions become large, it is unhandy/non-elegant to do this with a for loop in optimization. It would be nice, if I would have to only use the said variables, present in $J$.
My other question is, if I have found a gradient, for which I would think it is correct, then how can I check whether this is correct? The only way I know of currently, is to make that objective and use computed gradient to update said parameters, and see if $J$ indeed decreases. But probably this is not a sufficient check. But mainly I would be happy already, If I would know how to correctly compute that gradient! (Maybe there are some standard rules to use when matrices/vectors are present in an objective function, that I do not know of.)
By the way, I hope that it is apparent what symbols are elements of what variables, otherwise please ask.
For typing convenience, let me use the symbols $$\eqalign{ x &= {\hat x}_k \cr y &= x_{k+1} \cr w &= Ax-y \cr }$$ I will also use a colon to represent the trace/Frobenius product, i.e. $$\eqalign{A:B &= {\rm tr}(A^TB)}$$ Now I can write the cost function in terms of these symbols and find its differential and gradient $$\eqalign{ J &= \|w\|^2 = w:w \cr dJ &= 2w:dw = 2w:dA\,x = 2wx^T:dA \cr \frac{\partial J}{\partial A} &= 2wx^T = 2(Ax-y)x^T\cr }$$ To test the correctness of these expressions, first calculate $J$ for some value of $A$. Then add a small random perturbation to the matrix (let's call it $dA$) and calculate the change in the cost function $$\eqalign{ J_0 &= \|Ax-y\|^2 \cr J_1 &= \|(A+dA)x-y\|^2 \cr \Delta J &= J_1-J_0 \cr }$$ Compare this to the change predicted by the differential $$\eqalign{ dJ &= 2(Ax-y)x^T:dA \cr &= 2(Ax-y)^T\,dA\,x \cr }$$ You should find that $\,\,dJ \approx \Delta J$