How to calculate $\frac{\partial\Theta}{\partial L}$ if I know $\frac{\partial L}{\partial\Theta}$?

157 Views Asked by At

How to calculate $\frac{\partial\Theta}{\partial L}$ if I know $\frac{\partial L}{\partial\Theta}$?

Suppose I have a halved sum of squared errors loss:

$$L(\Theta)=\frac{1}{2}\sum^{M}(y-h(X\circ\Theta))^2$$

with constant inputs $X\in\mathbb{R}^{M\times{in}}$, parameters $\Theta\in\mathbb{R}^{in\times out}$ and ground truth/hypothesis $\{y, h(z)\}\in\mathbb{R}^{M\times out}$

Then according to this article, $\frac{\partial L}{\partial\Theta}$ can be computed by multiplying all partial derivatives in the path between $L(\Theta)$ and $\Theta$.

So, if I give a name to all computations, and write down the partial derivative between it and its inputs like:

$$f=\frac{1}{2}e\texttt{ and } \frac{\partial f}{\partial e}=\frac{1}{2}$$

$$e=\sum^M d\texttt{ and } \frac{\partial e}{\partial d}=1$$

$$d=c^2\texttt{ and } \frac{\partial d}{\partial c}=2c$$

$$c=y-b\texttt{ and } \frac{\partial c}{\partial b}=-1$$

$$b=h(a)\texttt{ and } \frac{\partial b}{\partial a}=h'(a)$$

$$a=X\circ \Theta\texttt{ and } \frac{\partial a}{\partial \Theta}=X$$

(Note: $\frac{\partial}{\partial Y}X\circ Y=X$ from Matrix Cookbook rule (38))

Then

$$\frac{\partial}{\partial\Theta}L(\Theta)=\frac{\partial f}{\partial e}\frac{\partial e}{\partial d}\frac{\partial d}{\partial c}\frac{\partial c}{\partial b}\frac{\partial b}{\partial a}\frac{\partial a}{\partial \Theta}$$

If replaced I get:

$$\frac{\partial}{\partial\Theta}L(\Theta)=d\circ-h'(a)\circ X$$

Which is:

$$\frac{\partial}{\partial\Theta}L(\Theta)=(y-h(X\circ\Theta))\circ -h'(X\circ\Theta)\circ X$$

Now after adding some transpositions and multiplying everything, I get a shape of $\frac{\partial}{\partial\Theta}L(\Theta)\in\mathbb{R}^{M\times in}$, which is different from the shape for $\Theta\in\mathbb{R}^{in\times out}$. But everything I can find online, tells me, that in order to perform a weight update with Gradient Descent, the shape of $\frac{\partial}{\partial\Theta}L(\Theta)$ should be the same as $\Theta$.

What am I doing wrong? I think I actually want $\frac{\partial\Theta}{\partial L}$ but I'm not sure if it makes even sense at all. I'd guess it has to do with, that I treat $\Theta$ as a single value, which it is not.

1

There are 1 best solutions below

0
On BEST ANSWER

Ok, if i understand correctly what you are asking then $L(\Theta)$ should be written as $$L(\Theta)=\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(Y-h(X\Theta))_{i,j}=||Y-h(X\Theta)||^2_F=(Y-h(X\Theta))\cdot(Y-h(X\Theta))$$ Where $L(\Theta)\in R;\ Y,h(X\Theta)\in R^{M\times\text{out}}; X\in R^{M\times\text{in}}, \Theta\in R^{\text{in}\times\text{out}}$, $||\cdot||_F$ - Frobenius matrix norm; and elementwise matrix product $\cdot$, i.e. $A\cdot B=\sum_{i,j}A_{i,j}B_{i,j}$ . So, using some rules of matrix differential calculus, which you can find here - Practical Guide to Matrix Calculus for Deep Learning - Andrew Delong, for example.

In particular, we will need rule (6) from the paper - $A\cdot(BC)=B^TA\cdot C$, rule (13) for elementwise matrix product - $d(A\cdot B)=dA\cdot B+A\cdot dB$, rule (12) for ordinary matrix product - $d(AB)=dAB+AdB$ and the fact $(*)$ that differential of a matrix is the matrix of differentials, i.e $(df(A))_{i,j}=df_{i,j}(A),\ f(X)\in R^{m\times n}$ . So $$dL(\Theta)=d[(Y-h(X\Theta))\cdot(Y-h(X\Theta))]=\\d(Y-h(X\Theta))\cdot(Y-h(X\Theta))+(Y-h(X\Theta))\cdot d(Y-h(X\Theta))=\\-dh(X\Theta)\cdot(Y-h(X\Theta))+(Y-h(X\Theta))\cdot(-dh(X\Theta))=\\-2(Y-h(X\Theta))\cdot dh(X\Theta)$$ So we have $dL(\Theta)=-2(Y-h(X\Theta))\cdot dh(X\Theta)$. Now, if there are no further restrictions on the form of the function $h$, then we can do only something like this using $(*)$ and (6) $$dL(\Theta)=-2(Y-h(X\Theta))\cdot dh(X\Theta)=\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}dh_{i,j}(X\Theta)=\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}h'_{i,j}(X\Theta)\cdot d(X\Theta)=\\\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}h'_{i,j}(X\Theta)\cdot Xd\Theta=\\\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}X^Th'_{i,j}(X\Theta)\cdot d\Theta=\\ [\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}X^Th'_{i,j}(X\Theta)]\cdot d\Theta$$ Where each $dh_{i,j}\in R$, consequently the derivative $h'_{i,j}\in R^{M\times\text{out}}$, i.e. of dimensions of $X\Theta$ . Note tthat now the differential is exactly of the form (17) in the paper, i.e. $dL(\Theta)=D\cdot d\Theta$, and $D$ is the sum of the matrices, shown in the previous derivation. This means that the needed derivative is exactly $$\frac{\partial}{\partial\Theta}L(\Theta)=D=\sum_{1\leq i\leq M, 1\leq j\leq \text{out}}(2h(X\Theta)-2Y)_{i,j}X^Th'_{i,j}(X\Theta)$$ Note also that $(2h(X\Theta)-2Y)_{i,j}\in R,\ X^T\in R^{\text{in}\times M},\ h'_{i,j}\in R^{M\times\text{out}}$, so $(2h(X\Theta)-2Y)_{i,j}X^Th'_{i,j}(X\Theta)\in R^{\text{in}\times\text{out}}$ and thus their sum $D\in R^{\text{in}\times\text{out}}$ has exactly the same dimensions as $\Theta$.