I come from a programming background. I am familiar with scalar calculus but not so much with vector/matrix calculus.
I am trying to understand stochastic gradient descent for multiple linear regression and needed to understand how after this step:
$\ (W):=‖XW−Y‖^2_F=tr((XW−Y)^⊤(XW−Y)) = tr(W^⊤X^⊤XW−Y^⊤XW−W^⊤X^⊤Y+Y^⊤Y)$
follows:
$\nabla_{\mathrm W} f (\mathrm W) = 2 \, \mathrm X^{\top} \mathrm X \mathrm W - 2 \, \mathrm X^{\top} \mathrm Y = 2 \, \mathrm X^{\top} \left( \mathrm X \mathrm W - \mathrm Y \right)$
where tr represents the trace, X is the design matrix, W is the Coefficient Matrix, f(W) is the cost function and Y is the target label.
Also I know there is another approach to finding solutions for such problem using Frobenius inner product but I have absolutely no idea about Frobenius inner products.
So I just want clear steps on how the derivation follows with the rules/laws used if there are any.
The Frobenius product is just a convenient notation for the trace $$A:BC = {\rm tr}(A^TBC)$$ The product distributes over addition $$A:(B+C) = A:B + A:C$$ and the cyclic property of the trace allows a product to be rearranged in various ways, e.g. $$\eqalign{ A:BC &= BC:A \cr &= A^T:(BC)^T \cr &= B^TA:C \cr &= AC^TA:B \cr }$$ Define the matrix $A=(XW-Y)$ whose full differential is $$\eqalign{dA &= dX\,W + X\,dW - dY \cr}$$ However, in the current problem, $(X,Y)$ don't change, so their differentials $(dX,dY)$ are zero, leaving $$\eqalign{dA &= X\,dW \cr}$$
One last point concerns the differential of various products.
The differential of the Frobenius product is $$\eqalign{d\,(A:B)&=dA:B + A:dB \cr&= B:dA + A:dB}$$ Similarly, the differential of a normal matrix product is $$\eqalign{d\,(AB)&=dA\,B + A\,dB}$$ However, this product cannot be rearranged because it is not commutative.
The differential of a Kronecker product is $$\eqalign{d\,(A\otimes B)&=dA\otimes B + A\otimes dB}$$ The differential of a Hadamard product is $$\eqalign{d\,(A\odot B)&=dA\odot B + A\odot dB\cr&=B\odot dA + A\odot dB }$$ The Hadamard product is commutative, and can be rearranged like the Frobenius product.
Now write the cost function in terms of the new variable.
Then calculate its differential and gradient. $$\eqalign{ f &= A:A \cr df &= 2A:dA = 2A:X\,dW = 2X^TA:dW \cr \frac{\partial f}{\partial W} &= 2X^TA = 2X^T(XW-Y) \cr }$$